Exploring Key Factors Driving Parents to Immunized Their Children in India using Regression Analysisby Mochammad Miftahul Fahmi (5972035)
Figure: Image of Immunization of a baby in India (Source: UNICEF).
WHAT WILL WE DO HERE?
This code is the continued version of this notebook here .
What is our agenda?
I will use mainly Pandas for data cleaning and reshaping,
Matplotlib is used for plot visualization,
Seaborn for spatial visualization, and Lastly, because the research here will mainly focus on Inference (find the variables that affect heavily on the immunization level), thus statsmodel will be primarily used for regression analysis. However, train-test split will also be utilized using sklearn.
Note: the goal of this research is still the same with previous research: primarily about 'immunization level (no of immunization done) on children age 0-3 years in India'.
Cheers and Happy New Year! 🍻🥳
In this part, main activities are:
Pandas for data cleaning and reshaping, Matplotlib is used for plot visualization, Seaborn for spatial visualization, and Lastly, because the research here will mainly focus on Inference (find the variables that affect heavily on the immunization level), thus statsmodel will be primarily used for regression analysis. However, train-test split will also be utilized using sklearn. # Declaration of Useful Library
# For dataframe processing (shp, csv)
import pandas as pd
import geopandas as gpd
import numpy as np
import os
from libpysal.weights import Queen
from pysal.lib import weights
from pysal.explore import esda
from splot.esda import moran_scatterplot, lisa_cluster, plot_local_autocorrelation
# For regression and EDA
from scipy.stats import pearsonr
from scipy.stats import mode
# statsmodel
import statsmodels.api as sm
from statsmodels.regression.linear_model import OLS
from statsmodels.stats.outliers_influence import variance_inflation_factor
## sklearn
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn import metrics
from sklearn.metrics import mean_squared_error
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import r2_score
# For mapping geographical features
import osmnx as ox
import seaborn as sns
import contextily as ctx
import mapclassify
# For plotting
import matplotlib.pyplot as plt
from matplotlib.patches import Patch
from matplotlib.lines import Line2D
from shapely.geometry import Point
Read the CSV file using read_csv. The csv was downloaded from open source Mission Antyodaya Village Facilities in SHRUG (https://www.devdatalab.org/shrug_download/). Information can be found here .
The analysis will be focused on "Subdistrict" and "District" level because of some reasons (explained on the Reflection at the bottom of Part 01).
# Read CSV file, at desired level
initial_data = pd.read_csv("./data/shrug-antyodaya-csv/antyodaya_pc11subdist.csv")
# The content of the CSV will be explained on Part 02
# Check the data at a glimpse using head
initial_data.head()
| pc11_state_id | pc11_district_id | pc11_subdistrict_id | total_population | male_population | female_population | total_hhd | total_hhd_engaged_in_farm_activi | total_hhd_engaged_in_non_farm_ac | is_govt_seed_centre_available | ... | open_uncovered_drainage | open_kuchha_drainage | fpo | no_electricity | internal_pucca_road | livestock_ext_services | _mean_p_miss | _core_p_miss | _target_weight_share | _target_group_max_weight_share | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3 | 41 | 225 | 778.0 | 407.0 | 371.0 | 155.0 | 0.0 | 0.0 | 0.0 | ... | 1.0 | 0.0 | 0.0 | 0.00000 | 0.00000 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
| 1 | 12 | 245 | 1563 | 446.0 | 216.0 | 230.0 | 95.0 | 40.0 | 45.0 | 0.0 | ... | 0.0 | 1.0 | 0.0 | 0.00000 | 0.00000 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
| 2 | 12 | 250 | 1626 | 603.0 | 253.0 | 350.0 | 129.0 | 110.0 | 3.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.45937 | 0.54063 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
| 3 | 12 | 253 | 1679 | 1380.0 | 828.0 | 552.0 | 228.0 | 120.0 | 70.0 | 0.0 | ... | 0.0 | 1.0 | 0.0 | 0.00000 | 0.00000 | 1.0 | 0.0 | 0.0 | 1.0 | 1.0 |
| 4 | 21 | 371 | 2774 | 297.0 | 155.0 | 142.0 | 70.0 | 55.0 | 15.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.00000 | 1.00000 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 |
5 rows × 171 columns
# Review the number of records (row) and columns
print("Check the Number of Rows and Columns")
num_rows, num_columns = initial_data.shape
print(f"Number of rows: {num_rows}")
print(f"Number of columns (variables): {num_columns}")
Check the Number of Rows and Columns Number of rows: 5480 Number of columns (variables): 171
Next, get the name of each column of the initial dataframe. It will be handy for us, the analyst, because we will need to know the title of the needed columns.
# Get the name of the column (column title)
column_names = list(initial_data.columns)
print("The name of the columns of the data:" )
for i, column_name in enumerate(column_names, start=1):
print(f"{i}. {column_name}")
The name of the columns of the data: 1. pc11_state_id 2. pc11_district_id 3. pc11_subdistrict_id 4. total_population 5. male_population 6. female_population 7. total_hhd 8. total_hhd_engaged_in_farm_activi 9. total_hhd_engaged_in_non_farm_ac 10. is_govt_seed_centre_available 11. availability_of_watershed_dev_pr 12. availability_of_rain_harvest_sys 13. availability_of_food_storage_war 14. availability_of_farm_gate_proces 15. availability_of_custom_hiring_ce 16. total_cultivable_area_in_hac 17. net_sown_area_in_hac 18. net_sown_area_kharif_in_hac 19. net_sown_area_rabi_in_hac 20. net_sown_area_other_in_hac 21. is_soil_testing_centre_available 22. is_fertilizer_shop_available 23. no_of_farmers_using_drip_sprinkl 24. area_irrigated_in_hac 25. total_unirrigated_land_area_in_h 26. availability_of_milk_routes 27. availability_of_poultry_dev_proj 28. availability_of_goatary_dev_proj 29. availability_of_pigery_developme 30. is_veterinary_hospital_available 31. availability_of_fish_community_p 32. availability_of_fish_farming 33. availability_of_aquaculture_ext_ 34. total_hhd_with_kuccha_wall_kucch 35. total_hhd_have_got_pmay_house 36. total_hhd_in_pmay_permanent_wait 37. total_hhd_got_benefit_under_stat 38. total_hhd_in_perm_waitlist_under 39. piped_water_fully_covered 40. is_village_connected_to_all_weat 41. public_transport 42. availability_of_railway_station 43. total_hhd_having_pmsbhgy_benefit 44. availability_of_elect_supply_to_ 45. availability_of_solor_wind_energ 46. total_hhd_having_solor_wind_ener 47. availability_of_panchayat_bhawan 48. csc 49. public_info_board 50. is_common_pastures_available 51. total_hhd_availing_pmuy_benefits 52. availability_of_public_library 53. recreational_centre 54. is_bank_available 55. is_bank_buss_correspondent_with_ 56. is_atm_available 57. total_hhd_availing_pmjdy_bank_ac 58. is_post_office_available 59. telephone_services 60. is_broadband_available 61. is_pds_available 62. total_hhd_having_bpl_cards 63. availability_of_primary_school 64. is_primary_school_with_electrici 65. is_primary_school_with_computer_ 66. is_primary_school_with_playgroun 67. is_primary_school_have_drinking_ 68. availability_of_mid_day_meal_sch 69. total_primary_school_students 70. total_primary_school_teachers 71. availability_of_middle_school 72. availability_of_high_school 73. availability_of_ssc_school 74. no_of_children_not_attending_sch 75. availability_of_govt_degree_coll 76. total_grd_and_pg_in_village 77. is_vocational_edu_centre_availab 78. total_trained_trainee_under_skil 79. availability_of_jan_aushadhi_ken 80. total_hhd_registered_under_pmjay 81. is_community_waste_disposal_syst 82. total_hhd_with_clean_energy 83. is_community_biogas_waste_recycl 84. is_aanganwadi_centre_available 85. is_early_childhood_edu_provided_ 86. total_childs_aged_0_to_3_years 87. total_childs_aged_0_to_3_years_r 88. total_childs_aged_3_to_6_years_r 89. total_childs_aged_0_to_3_years_i 90. total_childs_categorized_non_stu 91. total_anemic_pregnant_women 92. total_anemic_adolescent_girls 93. total_underweight_child_age_unde 94. total_male_child_age_bw_0_6 95. total_female_child_age_bw_0_6 96. total_minority_children_getting_ 97. total_minority_hh_provided_bank_ 98. no_of_physically_challenged_recv 99. total_hhd_with_more_than_2_child 100. availability_of_mother_child_hea 101. total_hhd_availing_pension_under 102. total_shg 103. total_hhd_mobilized_into_shg 104. total_no_of_shg_promoted 105. total_hhd_mobilized_into_pg 106. total_shg_accessed_bank_loans 107. is_bee_farming 108. is_sericulture 109. is_handloom 110. is_handicrafts 111. availability_of_community_forest 112. availability_of_minor_forest_pro 113. total_hhd_source_of_minor_forest 114. availability_of_cottage_small_sc 115. total_hhd_engaged_cottage_small_ 116. availability_of_adult_edu_centre 117. total_no_of_registered_children_ 118. total_no_of_children_0_to_6_year 119. total_no_of_pregnant_women 120. total_no_of_pregnant_women_recei 121. total_no_of_lactating_mothers 122. total_no_of_lactating_mothers_re 123. total_no_of_women_delivered_babi 124. total_no_of_children_in_icds_cas 125. total_no_of_young_anemic_childre 126. total_no_of_newly_born_children 127. total_no_of_newly_born_underweig 128. total_hhd_not_having_sanitary_la 129. total_no_of_eligible_beneficiari 130. total_no_of_beneficiaries_receiv 131. gp_total_no_of_eligible_benefici 132. gp_total_no_of_beneficiaries_rec 133. gp_total_hhd_eligible_under_nfsa 134. gp_total_hhd_receiving_food_grai 135. total_no_of_farmers_registered_u 136. total_no_of_farmers_subscribed_a 137. total_no_of_farmers 138. total_no_of_farmers_received_ben 139. total_no_farmers_adopted_organic 140. total_no_of_farmers_add_fert_in_ 141. total_no_of_elected_representati 142. total_no_of_elect_rep_oriented_u 143. total_no_of_elect_rep_undergone_ 144. total_approved_labour_budget_for 145. total_expenditure_approved_under 146. total_area_covered_under_irrigat 147. total_hhd_having_piped_water_con 148. pc11_land_area 149. canal 150. surface_water 151. ground_water 152. other_irrigation 153. any_primary_sch_toilet 154. mandi 155. regular_market 156. weekly_haat 157. phc 158. chc 159. sub_centre 160. closed_drainage 161. open_covered_drainage 162. open_uncovered_drainage 163. open_kuchha_drainage 164. fpo 165. no_electricity 166. internal_pucca_road 167. livestock_ext_services 168. _mean_p_miss 169. _core_p_miss 170. _target_weight_share 171. _target_group_max_weight_share
Reflection of Part 01
Some important reflection based on the decision of which data will be the main set:
A. The main dataset chosen
The main dataset will be SHRUG in subdistrict level because of important reasons and logic: 1) To prevent bias, as the author (me) is not from local of the country (India), imagining space in subdistrict and district is easier because it is close with the definition of a big town (for example, Anjar subdistrict in Gujarat that has an area of 1404.89 km2). This will help us to understand more about amenities, as it is easier to imagine health facilities in a town instead of in a state. 2) Regarding the data analysis process:
Hence, I will use "Subdistrict" level csv as the initial main data. Furthermore, for shp file, I will use also "District" level SHP because shp file from "Subdistrict" resulting in 'many islands' during process of creating Spatial Weight Matrix (this is part of Non Linear Process).
The shrid csv level is not chosen because the availability of the data. As you can see on the SHRUG Documentation, the shrid level has 11% of missing data, while subdistrict and district level have only 7% and 4% respectively. Subdistrict-level data also may be more readily available and accurate compared to finer levels such as villages and towns (shrid).
On other hand, state level is too broad. State-level data is frequently used for policy advocacy and reporting to higher authorities, however it lacks of nuance of ammenities case study.
However, the district level will still be analyzed, so we can compare cities between different districts. Also, some subdistrict geometry data result in many unintended island, thus we will still need the use gpkg in district level.
B. Inclusivity and Representativeness 1) Number of missing data: As explained above, it is wiser to have lesser data missing in a dataset (assuming that the number of represented district is equal). The number of missing data in shrid level is above 10%, compared to other available dataset. This missing can also be examined on the metadata in SHRUG (link: https://docs.devdatalab.org/SHRUG-Metadata/Mission%20Antyodaya%20Village%20Facilities%20%282020%29/Tables/antyodaya-metadata/)
2) The chosen main dataset: subdistrict level contains no missing total_population data in from 5444 subdistricts and 613 districts. Based on reference (https://en.wikipedia.org/wiki/Administrative_divisions_of_India), the data covers 5444 out of 6057 subdistricts (89%, 2018 data) and 613 out of 808 districts (76%, 2022 data), which already reached >75% of area in India.
3) When split the test-train, we use STRATIFIED based split so that every states in India is represented.
C. Other Reflections using CDS Framework
In this part, main activities are:
Research Background
COVID-19 pandemic reminds us the importance of vaccine and immunization. Through immunization, our body will develop an immune foundation, thus preventing severe illnesses and contributing to herd immunity. Hence, the earlier the foundation build, the more we contribute to the herd immunity. Therefore, immunization for children is important. It safeguards against various and potentially life-threatening diseases. Protecting this vulnerable age group is vital for long-term public health and the well-being of the community.
This research is basically a preliminary research to find further factors driving immunization level among children aged 0-3 years. As initial start, there are some factors causing the immunization level, mainly economics, social, and political factors. Niranjan (2018)[1] underpins that women from the SHGs (Self Help Group; a community organization to help family and women financially) with health intervention were more likely to overall healthier practice, including providing age-approproate immunization
Kumar (2009)[2] found that educational level of women and husband, level of income, infrastructures impacts women participation of the SHGs. In this research, we will assume that those will also impact the level of immunization.
Research Objective
This program is part of research to understand further the driving factors impacting the level of immunization level on children aged 0-3 years. In the end, this research will narrow down which variables are likely to influence the level of immunization (also which variables have heavier impact to the output), allowing future research to focus more on important variables thus creating proper decision making.
Thus, in this research, we will focus more on INFERENCE instead of prediction
Scope of the Data
The SHRUG data of Mission Antyodaya Village Facilities contains large dataset of shrid, subdistrict, and district level of important factors related to facilities and socio-economics feature of India collected 2018-2020.
Some important sources:
[1] Saggurti, N., Atmavilas, Y., Porwal, A., Schooley, J., Das, R., Kande, N., Irani, L., & Hay, K. (2018). Effect of health intervention integration within women's self-help groups on collectivization and healthy practices around reproductive, maternal, neonatal and child health in rural India. PloS one, 13(8), e0202562. https://doi.org/10.1371/journal.pone.0202562
[2] Participation in Self-Help Group Activities and its Impacts: Evidence from South India (Online at https://mpra.ub.uni-muenchen.de/19943/ MPRA Paper No. 19943, posted 14 Jan 2010 00:06 UTC)
Hypothesis
These hypotheses is part of non-linear process, in which at the start of this research, the number of variables are much higher in number. However, one of the most relevant and high affecting variables are below:

#
H1: The number of graduates and post graduates (represents the education level) positively impacts the level of immunized children (age 0-3 years).
H2: The number of Self Help Groups (SHG) (represents community help) positively impacts the level of immunized children (age 0-3 years)
H3: The number of Integrated Child Development Services (ICDS) beneficiaries (represents subsidy/ economy) positively impacts the level of immunized children (age 0-3 years)
H4: The number of health facilities in an area (represents infrastructure) positively impacts the level of immunized children (age 0-3 years)
H5: The number of Pradhan Mantri Matru Vandana Yojana (PMMVY) beneficiaries (represents subsidy/ economy) positively impacts the level of immunized children (age 0-3 years)
H6: the level of immunized children (age 0-3 years) is affected by neighbouring area that also has certain level of immunization.
Hypothesis Explanation
H1: The number of graduates and post graduates (represents the education level, of parents) positively impacts the level of immunized children (age 0-3 years). The education level will improve the cognitive ability and science-consiousness about the fact that immunization will improve immune system of our body (acknowledge that it is science based evidence). It is expected that without prior knowledge of the benefit of the immunization, people will also not willing to do that to their children.
H2: The number of Self Help Groups (SHG) (represents community help) positively impacts the level of immunized children (age 0-3 years). Self Help Groups contribute to community well-being, potentially boosting child immunization rates (age 0-3) through shared resources, awareness, and support. It is expected that if a bunch group of people do immunization to their children, the willingness to provide immunization will also increase.
H3: The number of Integrated Child Development Services (ICDS) beneficiaries (represents subsidy) positively impacts the level of immunized children (age 0-3 years). This hypothesis builds on the idea that subsidies offered through ICDS contribute to improved accessibility, affordability, and awareness of healthcare services, including immunization.
H4: The number of health facilities in an area (represents infrastructure) positively impacts the level of immunized children (age 0-3 years). This hypothesis is occured because I think that a well-developed healthcare infrastructure facilitates increased immunization coverage. Higher number of health facilities per area are likely to provide better access to vaccination services, reducing barriers and encouraging parents to adhere to immunization schedules
H5: The number of Pradhan Mantri Matru Vandana Yojana (PMMVY) beneficiaries (represents subsidy) positively impacts the level of immunized children (age 0-3 years). Pradhan Mantri Matru Vandana Yojana is a maternity benefit program run by the government of India, having quite the same explanation as H3 above (immunization is part of planned scheme for this program, as explained in https://wcd.nic.in/sites/default/files/PMMVY%20Scheme%20Implemetation%20Guidelines%20._0.pdf). With push from government programme, parents will aware of this and will follow the recommendation after guidance.
H6: the level of immunized children (age 0-3 years) is affected by neighbouring area that also has certain level of immunization. The immunization practices of one area (subdistricts) may impact the immunization behavior of nearby area. It popped out from the idea that communities with higher immunization rates create a positive norm that influences neighboring communities.
Hypothesis Testing Method
Summary of Methods
Process in this research are step by step of using statistical tools below:
To improve inclusivity, all the data accross India (after reduction of not-usable data), which represents >80% of Indian subdistricts.
# Constructs or Variables
Variables that are investigated here and the explanation: | Variables | Represented by | Meaning | Unit | Type | |----------|----------|----------|----------|----------| | The number of graduates and post graduates | total_grd_and_pg_in_village, sums up until subdistrict and district level | Attainment of High Level Education (represents the awareness of parenting and health) | number, int | independent var | | The number of SHG | total_shg | SHG: Self Help Group, bottom up community organization (NGO) with goals to empower woman in financial and leadership (parenting) | number, int | independent var | | The number of ICDS beneficiaries (lactating woman) | total_no_of_lactating_mothers_re | ICDS: Integrated Child Development Services | number, int | independent var | | The number of PMMVY beneficiaries | total_no_of_eligible_beneficiari | Pradhan Mantri Matru Vandana Yojana (PMMVY) is a Maternity Benefit Programme | number, int | independent var | | The availability of Health and Child Health Facility | availability_mother_child_hea | Infrastructure available that may be related to immunization on Children | number, int | independent var | | the number of immunized children | total_childs_aged_0_to_3_years_i | Percentage of children (age 0-3 yrs) with immunization | number, int | dependent var | | the number of immunized children (per capita of number of children 0-3 years) | total_childs_aged_0_to_3_years_i divided by total_childs_aged_0_to_3_years | Percentage of children (age 0-3 yrs) with immunization, use per capita to compare between districts (explanation on Part 04 and 05) | % | dependent var, same as above but only different unit |
In summary, factors that might affecting the Immunization on childs under 3 years age:
Reasons of choosing the variables
Most of the selection of the variables in this preliminary research are based on these factors:
Explanation on each of the var (no 1-4 is independent (also no 6), no 5 is dependend)
Inclusion beneficial: In this research, we will analyze most part of subdistrict all accross India. Price: the memory of computer + slow running time, yet the prediction should be more accurate than taking case study.
Reflection
A. Possible Other Variables that will improve the OUTPUT of RESEARCH.
Factor: Education Level. Alternative variables:
Factor: Economy. Alternative variables:
Factor: Health Facilities. Alternative variables:
Time dimension:
Proper discussion about per capita variables: some variable should be scaled down to per capita (so it is fair to be compared). However, somehow, from this SHRUG data, it makes the regression and analysis more bias (the correlation is decreasing). If possible: communication with the India Govt, or direct per capita measurement data that I can reach.
B. Possible Rejecting Condition
The hypothesis explained above would be rejected in various condition:
H1 --> Rejected if education level is not align (low correlation) with the immunization level of children age 0-3 years in India.
H2 --> Rejected if the presence of SHGs does not demonstrate a statistically significant correlation with the immunization level of children age 0-3 years in India.
H3 --> Rejected if the communities with a higher number of ICDS beneficiaries do not exhibit a statistically significant increase with the immunization level of children age 0-3 years in India.
H4 --> Rejected if the presence of Health Facilities does not demonstrate a statistically significant correlation with the immunization level of children age 0-3 years in India.
H5 --> Rejected if the communities with a higher number of PMMVY beneficiaries do not exhibit a statistically significant increase with the immunization level of children age 0-3 years in India.
H6 --> Less correlation of area and its neighbouring (more area with high immunization level around low area with immunization level).
Note: other than coefficient in regression, the value will be evaluated. This will not only gained from Pearson Correlation, but also Linear Regression.
C. Possible Bias
Possible biases are explained below:
C.1 Possible cases in which the hypothesis will be falsely rejected
H2 --> Rejected, but turns out it should not be rejected. Because of SHGs is heterogeneous (dynamics) across different communities about immunization level, and the analysis does not account for this variability.
H4 --> Rejected, but it is false because of certain area has lower civilization area (more empty area), thus has very low facilities!
C.2 Possible cases in which the hypothesis will be falsely accepted
H1 --> Accepted, but it is false because we could not make sure that the degree is science and health related or not. It also contains efficiency (meaning that having degree not necessarily means that the person is also have enough cognitive capability).
H3 --> Accepted, but it is false because of other subsidy from government that directly related to the immunization (and they combined each other), for example: only ICDS beneficiaries could accept the subsidy for immunization.
D. Final Reflection and LIMITATION OF THE RESEARCH!!
Important perspective that I am not taking into account:
Some important Critical Data Science framework:
E. What I made differently this time (compared to previous assignment):
In this part, main activities are:
0. Initial Variables (before elimination throughout entire story)
As I have explained on Part 01 and 02 above, I will tell more story about broader variables that potentially affect the immunization level on children* which finally will later be narrowed down to variables that statistically have an impact on the dependent variable. This story is based on the knowledge of the author (Fahmi) after taking Regression and ML courses by Mr Trivik.
Basic Variables:
Related to identity: 'pc11_state_id','pc11_district_id','pc11_subdistrict_id',
Primary basic variables:
Potentially Useful Predictors
Dependent Variable itself 'total_childs_aged_0_to_3_years','total_childs_aged_0_to_3_years_i',
1. Data Subsetting
selected_columns = ['pc11_state_id','pc11_district_id','pc11_subdistrict_id',
'total_population','male_population','female_population','total_no_of_registered_children_','total_hhd',
'total_childs_aged_0_to_3_years','total_childs_aged_0_to_3_years_i',
'total_no_of_lactating_mothers','total_no_of_lactating_mothers_re',
'total_no_of_beneficiaries_receiv','total_no_of_eligible_beneficiari',
'total_shg',
'total_grd_and_pg_in_village',
'total_trained_trainee_under_skil',
'availability_of_mother_child_hea',
'availability_of_adult_edu_centre',
'total_no_of_elect_rep_oriented_u','public_info_board'
]
# Select only the columns in the 'selected_columns' array
subset_df = initial_data[selected_columns]
subset_df.describe()
| pc11_state_id | pc11_district_id | pc11_subdistrict_id | total_population | male_population | female_population | total_no_of_registered_children_ | total_hhd | total_childs_aged_0_to_3_years | total_childs_aged_0_to_3_years_i | ... | total_no_of_lactating_mothers_re | total_no_of_beneficiaries_receiv | total_no_of_eligible_beneficiari | total_shg | total_grd_and_pg_in_village | total_trained_trainee_under_skil | availability_of_mother_child_hea | availability_of_adult_edu_centre | total_no_of_elect_rep_oriented_u | public_info_board | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 5480.000000 | 5480.000000 | 5480.000000 | 5.444000e+03 | 5.444000e+03 | 5444.000000 | 5444.000000 | 5444.000000 | 5444.000000 | 5444.000000 | ... | 5444.000000 | 5444.000000 | 5444.000000 | 5444.000000 | 5444.000000 | 5444.000000 | 5444.000000 | 5444.000000 | 5444.000000 | 5444.000000 |
| mean | 19.396533 | 367.731204 | 2948.070803 | 1.781704e+05 | 8.930056e+04 | 82027.793146 | 8894.053645 | 36956.871279 | 6709.910184 | 4607.438047 | ... | 1140.140314 | 499.866559 | 605.912946 | 1346.499592 | 6864.305910 | 1032.557170 | 0.466686 | 0.101931 | 411.095649 | 0.525757 |
| std | 8.557839 | 167.451590 | 1709.091099 | 1.739284e+05 | 8.844371e+04 | 78541.294238 | 9686.785642 | 31786.638222 | 7248.253735 | 4964.223688 | ... | 1259.746154 | 589.028709 | 697.970053 | 1116.374731 | 9194.612951 | 1278.405733 | 0.251253 | 0.147964 | 613.340890 | 0.274559 |
| min | 1.000000 | 1.000000 | 3.000000 | 3.221229e+01 | 1.870391e+01 | 13.508380 | 0.000000 | 6.234637 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 10.000000 | 231.000000 | 1429.750000 | 6.000286e+04 | 2.981878e+04 | 28914.750000 | 2867.255248 | 14795.653926 | 2037.750000 | 1538.221895 | ... | 379.878028 | 119.000000 | 154.711955 | 613.968092 | 1927.626776 | 276.832862 | 0.279317 | 0.005546 | 51.000000 | 0.315324 |
| 50% | 21.000000 | 380.000000 | 2940.500000 | 1.316506e+05 | 6.572600e+04 | 60971.320505 | 6102.355860 | 28478.677296 | 4664.592475 | 3142.372016 | ... | 768.572493 | 300.415780 | 376.551707 | 1103.861693 | 3934.841645 | 632.705066 | 0.448054 | 0.052353 | 198.109337 | 0.522764 |
| 75% | 28.000000 | 533.000000 | 4373.250000 | 2.353213e+05 | 1.159880e+05 | 107664.463758 | 11091.076314 | 49461.337567 | 8560.722341 | 5688.250000 | ... | 1401.113626 | 677.745768 | 809.000000 | 1806.286511 | 7907.854403 | 1345.000000 | 0.642838 | 0.135347 | 542.915193 | 0.744552 |
| max | 35.000000 | 640.000000 | 5924.000000 | 1.927979e+06 | 1.015549e+06 | 900343.899912 | 109745.000000 | 323759.000000 | 82890.000000 | 49923.746582 | ... | 12921.000000 | 7104.000000 | 8364.977439 | 15806.100780 | 91699.845468 | 20077.204134 | 1.000000 | 1.000000 | 12537.922184 | 1.000000 |
8 rows × 21 columns
2. Manage Missing Data
First, we need to see which index that are NaN.
subset_df.isna().sum()
pc11_state_id 0 pc11_district_id 0 pc11_subdistrict_id 0 total_population 36 male_population 36 female_population 36 total_no_of_registered_children_ 36 total_hhd 36 total_childs_aged_0_to_3_years 36 total_childs_aged_0_to_3_years_i 36 total_no_of_lactating_mothers 36 total_no_of_lactating_mothers_re 36 total_no_of_beneficiaries_receiv 36 total_no_of_eligible_beneficiari 36 total_shg 36 total_grd_and_pg_in_village 36 total_trained_trainee_under_skil 36 availability_of_mother_child_hea 36 availability_of_adult_edu_centre 36 total_no_of_elect_rep_oriented_u 36 public_info_board 36 dtype: int64
The NAN data counted is 36 (out of 5444 observations), which is about 0,66% of the total observation in each of the columns/ variables. Thus, it is more efficient to delete the NAN data.
Here, I prefer to delete the NA/ NAN values, with some reasons:
# Remove NA value of point 1 and 2 above
columns_to_remove_na = ['total_shg', 'total_population','availability_of_adult_edu_centre','availability_of_mother_child_hea']
# Drop rows with NA values in those columns
subset_df = subset_df.dropna(subset=columns_to_remove_na)
Finally, check if there is more NAN value
# Recheck if there is NA value again
column_where_na = []
row_where_na = []
# Select all columns inside the subset_df DataFrame
all_columns = subset_df.columns
# Line below is used by me to check in each of column, if there is NA. then if there is NA, put it in the new empty array above.
for col in all_columns:
if subset_df[col].isna().any():
na_loc = subset_df.index[subset_df[col].isna()].tolist()
print(f"{col}: NaN at index {na_loc}")
column_where_na.append(col)
row_where_na.extend(na_loc)
if not column_where_na:
print("No NaN values found. Great Job yay!")
else:
print("There are NaN values. Please check.")
print("Columns with NaN values:", column_where_na)
print("Rows with NaN values:", row_where_na)
No NaN values found. Great Job yay!
3. Initial Visualization
AT FIRST we will try to focus on the absolute number (not per capita number) of variables to see the global overview.
Histogram will be used to see the profile of each of the variables. It will be mainly to see how centralized the data will be.
# I defined which column that I will plot in histogram
columns_hist = [
'total_childs_aged_0_to_3_years_i',
'total_population',
'total_no_of_beneficiaries_receiv',
'total_no_of_lactating_mothers_re',
'total_shg',
'total_grd_and_pg_in_village',
'total_trained_trainee_under_skil',
'total_no_of_elect_rep_oriented_u',
'availability_of_adult_edu_centre',
'availability_of_mother_child_hea'
]
# public_info_board',
# Then, I set the number of columns and rows for the layout
num_cols = 2
num_rows = 5
# Set up subplots for each column
fig, axes = plt.subplots(nrows=num_rows, ncols=num_cols, figsize=(10, 16))
axes = axes.flatten()
# Create histograms in each of the columns
for i, column in enumerate(columns_hist):
axes[i].hist(subset_df[column].dropna(), bins=50, color='blue', alpha=0.7)
axes[i].set_xlabel('Value')
axes[i].set_ylabel('Frequency')
if column == 'total_grd_and_pg_in_village':
axes[i].set_title('Number of Graduates and Postgrads')
axes[i].set_xlim(0, 50000) # Set x-limit from 0 to 10000
if column == 'total_shg':
axes[i].set_title('Number of SHGs')
axes[i].set_xlim(0, 8000) # Set x-limit from 0 to 10000
if column == 'total_no_of_lactating_mothers_re':
axes[i].set_title('Number of Lactating Mothers Receiving ICDS')
axes[i].set_xlim(0, 10000) # Set x-limit from 0 to 10000
if column == 'total_childs_aged_0_to_3_years_i':
axes[i].set_title('Number of Children Age 0-3 Years Immunized')
axes[i].set_xlim(0, 30000) # Set x-limit from 0 to 30000
if column == 'total_no_of_beneficiaries_receiv':
axes[i].set_title('Number of PMMVY Beneficiaries')
axes[i].set_xlim(0, 3000) # Set x-limit from 0 to 3000
if column == 'total_trained_trainee_under_skil':
axes[i].set_title('Number of Trainees')
axes[i].set_xlim(0, 10000) # Set x-limit from 0 to 10000
if column == 'availability_of_adult_edu_centre':
axes[i].set_title('Availability of Adult Edu Center')
if column == 'total_no_of_elect_rep_oriented_u':
axes[i].set_title('Number of Elected Representative for Govt Program')
axes[i].set_xlim(0, 4000) # Set x-limit from 0 to 4000
if column == 'public_info_board':
axes[i].set_title('Availability of Public Information Board')
if column == 'total_population':
axes[i].set_title('Total Population in each of Subdistricts')
if column == 'availability_of_mother_child_hea':
axes[i].set_title('Availability of Mother-Child Heath Facil.')
# Add a single title for all subplots
fig.suptitle('Histogram of Socioeconomic Key Variables in India', fontsize=16)
plt.tight_layout()
plt.show()
Important: Graphical Excellence will be put in the end of each Part (not each of the graph), so the reader is more convenience in following the story that we are making here.
As we can see above, especially the last two histograms, there are some points of attention:
The 'availability' variables, such as availability of health facilities and education center, they are all not categorical while it should be. I have checked the raw files from the SHRUG, for both district and subdistrict, they both have the same characteristics. This would lead to problem if we do not understand the idea behind each of the variables.
Luckily, it is stated on the SHRUG documentation (https://docs.devdatalab.org/SHRUG-Metadata/Mission%20Antyodaya%20Village%20Facilities%20%282020%29/antyodaya-metadata/#:~:text=refer%20to%20this-,note,-for%20more%20details):
"(they) aggregated binary data on availability of facilities (yes/no) using 2011 Census population weights. This gives the following interpretation for any variable: share of population within a location that has access to a facility. Example: Consider public_transport - whether a village has access to public transport. In the subdistrict-level data, the variable will take values between 0 and 1. A value of 0 for a subdistrict would mean that 0% of population within the subdistrict has access to public transport."
The problem is then: the SHRUG data is based on 2020 record, while thus aggregation on binary data is using 2011 cencus weight. To make the data more accurate, luckily (again), we can augment the data from external or other sources, minding the same context. There is available data from Indian Government, under the same programme with our focus here: Mission Antyodaya Survey Data which you can access from Open Government Data (OGD) of India (https://data.gov.in/catalog/mission-antyodaya-survey-data?page=1). The data year is 2020, the same year with the SHRUG's Mission Antyodaya Survey Data (that is the only available data there).
We can also get the data from website of the program: https://missionantyodaya.nic.in/ma2020/rawData2020.html. On the website, there is also the documentation of each of the variables.
We can examine our focus on the 'strange' data, to recheck whether our judgement is true or not. 'Strange' variables:
'availability_of_adult_edu_centre',
'availability_of_mother_child_hea',
Then, looking at the documentation (download by clicking "Download Supportive Documents" from https://missionantyodaya.nic.in/ma2020/rawData2020.html):
Hence, our judgement that the data is not categorical by looking at the plot of histogram above (last two histograms), is quite right.
Then, what next?
The external data will be downloaded and be augmented to our main dataset based on the subdistrict_id. In this case, next, what important is to augment the data. Let's go!
4. Data Augmentation
There will be two data augmentations:
Step 01: Preparing the Raw Data from Open Government Data (OGD) India
# Location of where I put the data
folder_path = './data/ogd_downloadeddata_allindia'
# Because I found that not all states are available, my plan to use for loop containing 1.xlsx until 38.xlsx was cancelled
# Instead, to make it more flexible, we will scan all available file from the folder_path and creat a dataframe containing the name of the file that has been scanned
excel_files = [file for file in os.listdir(folder_path) if file.endswith('.xlsx')]
# Initialize an empty DataFrame to store the combined data
combined_df = pd.DataFrame()
# Read all the excel files and combine it inside the combined_df
# Yes, it take 10 mins for my computer to run this... Good Luck to us.
for file in excel_files:
file_path = os.path.join(folder_path, file)
df = pd.read_excel(file_path)
combined_df = pd.concat([combined_df, df], ignore_index=True)
Since we will need the OGD data above only on column 'availability_of_mother_child_hea' and 'availability_of_public_board', we will slice the combined_df.
selected_columns = [
'state_code',
'state_name',
'district_code',
'district_name',
'sub_district_code',
'sub_district_name',
'total_childs_aged_0_to_3_years_immunized',
'total_population',
'availability_of_adult_edu_centre',
'availability_of_public_information_board',
'availability_of_mother_child_health_facilities'
]
ogd_data = combined_df[selected_columns].copy()
Then, let us check the unique value on those two important variables:
selected_columns = [
'availability_of_adult_edu_centre',
'availability_of_mother_child_health_facilities'
]
# Please note that some columns from ODP data have different naming compared to SHRUG
for column in selected_columns:
value_counts = ogd_data[column].value_counts()
print(f"Value counts for '{column}':\n{value_counts}\n")
Value counts for 'availability_of_adult_edu_centre': availability_of_adult_edu_centre 2.0 605522 1.0 42806 0.0 7 135.0 1 Name: count, dtype: int64 Value counts for 'availability_of_mother_child_health_facilities': availability_of_mother_child_health_facilities 2 444216 1 204112 0 5 120 2 20 2 35 2 26 1 350 1 140 1 41 1 56 1 24 1 22 1 10 1 52 1 27 1 15 1 302 1 182 1 18 1 30 1 320 1 110 1 352 1 108 1 Name: count, dtype: int64
We can see that there is still some outlier on 'availability_of_adult_edu_centre' column. Based on the documentation that I have explained above, the value should only be 1 or 2. But, there is value 0 and 135 on variable availability_of_adult_edu_centre, and even 0, 120, 20, etc on variable availability_of_mother_child_health_facilities.
Since the count is very small (compared to normal 1 and 2 data), we will delete those outliers or error data.
ogd_data = ogd_data[(ogd_data['availability_of_adult_edu_centre'] != 0) & (ogd_data['availability_of_adult_edu_centre'] != 135)]
ogd_data = ogd_data[(ogd_data['availability_of_mother_child_health_facilities'] >= 1) & (ogd_data['availability_of_mother_child_health_facilities'] <= 2)]
ogd_data.head()
| state_code | state_name | district_code | district_name | sub_district_code | sub_district_name | total_childs_aged_0_to_3_years_immunized | total_population | availability_of_adult_edu_centre | availability_of_public_information_board | availability_of_mother_child_health_facilities | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 19 | WEST BENGAL | 316 | MALDAH | 2208 | Harischandrapur - I | 10 | 261.0 | 2.0 | 1 | 2 |
| 1 | 19 | WEST BENGAL | 316 | MALDAH | 2213 | Ratua - II | 0 | 1050.0 | 2.0 | 1 | 2 |
| 2 | 19 | WEST BENGAL | 312 | HOOGHLY | 2348 | Haripal | 51 | 4176.0 | 2.0 | 2 | 2 |
| 3 | 19 | WEST BENGAL | 318 | MEDINIPUR WEST | 2468 | Dantan - II | 16 | 269.0 | 2.0 | 1 | 1 |
| 4 | 19 | WEST BENGAL | 304 | 24 PARAGANAS SOUTH | 2414 | Budge Budge - II | 268 | 7800.0 | 2.0 | 2 | 1 |
NEXT:
As we can see above, each row equal to number of observations with village basis. Thus, there are two rows with same subdistrict: 'Maldah' on index 0 and 1. In fact, that is too granular for our research here if we use village basis. We will only need to have district and subdistrict data (remember our decision on Part 00 and 01), not village.
Thus, we will need to make a pivot, which is basically choosing a representative data so that each of subdistrict has one value of 'availability_of_adult_edu_centre' and 'availability_of_mother_child_health_facilities'.
Super important notes on logic behind choosing the representative data:
# Group by the relevant columns
grouped_data = ogd_data.groupby(['state_code', 'state_name', 'district_code', 'district_name', 'sub_district_code', 'sub_district_name'])
# Define aggregation functions
agg_functions = {
'total_childs_aged_0_to_3_years_immunized': 'sum',
'total_population': 'sum',
'availability_of_adult_edu_centre': lambda x: x.mode().iloc[0] if not x.mode().empty else None,
'availability_of_mother_child_health_facilities': lambda x: x.mode().iloc[0] if not x.mode().empty else None
}
# Apply aggregation functions to create the pivoted DataFrame
ogd_data_pivot = grouped_data.agg(agg_functions).reset_index()
ogd_data_pivot.head()
| state_code | state_name | district_code | district_name | sub_district_code | sub_district_name | total_childs_aged_0_to_3_years_immunized | total_population | availability_of_adult_edu_centre | availability_of_mother_child_health_facilities | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | JAMMU AND KASHMIR | 1 | ANANTNAG | 53 | Pahalgam | 2117 | 61296 | 2.0 | 2 |
| 1 | 1 | JAMMU AND KASHMIR | 1 | ANANTNAG | 54 | Bijbehara | 2146 | 112828 | 2.0 | 2 |
| 2 | 1 | JAMMU AND KASHMIR | 1 | ANANTNAG | 55 | Anantnag | 2054 | 96869 | 2.0 | 1 |
| 3 | 1 | JAMMU AND KASHMIR | 1 | ANANTNAG | 56 | Shangus | 2608 | 96970 | 2.0 | 2 |
| 4 | 1 | JAMMU AND KASHMIR | 1 | ANANTNAG | 57 | Kokernag | 3372 | 142602 | 2.0 | 2 |
Step 02: Combine SHP (GEOPACKAGE) and SHRUG
Why? because the SHRUG files only contain code name for state, district, and subdistricts. We need "real name" so it is easier to validate the data in each of process with reality (or example, by googling).
First, SUBDISTRICT SHRUG + GPKG
map_path = './data/shrug-pc11subdist-poly-gpkg/subdistrict.gpkg'
map = gpd.read_file(map_path)
gdf = map
The column 'pc11_subdistrict_id' of GDF is stored as object and the other one (the subset_df, from Part 01 and 02) is stored as int. We will change both the object and int datatype to float, to prevent if in the future we will need to merge some other data. Float has more 'flexible' value than 'integer' (for example, float can have decimal).
gdf.pc11_subdistrict_id = gdf.pc11_subdistrict_id.astype(float)
subset_df.pc11_subdistrict_id = subset_df.pc11_subdistrict_id.astype(float)
# Name the merged variable as geo_pop
# geo_pop means data that contains population and its socioeconomic + geometry!!
geo_pop = gdf.merge(subset_df, on='pc11_subdistrict_id')
geo_pop.head(3)
| pc11_state_id_x | pc11_district_id_x | pc11_subdistrict_id | subdistrict_name | geometry | pc11_state_id_y | pc11_district_id_y | total_population | male_population | female_population | ... | total_no_of_lactating_mothers_re | total_no_of_beneficiaries_receiv | total_no_of_eligible_beneficiari | total_shg | total_grd_and_pg_in_village | total_trained_trainee_under_skil | availability_of_mother_child_hea | availability_of_adult_edu_centre | total_no_of_elect_rep_oriented_u | public_info_board | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 24 | 468 | 3722.0 | Lakhpat | MULTIPOLYGON (((68.39398 23.44476, 68.39264 23... | 24 | 468 | 62117.708950 | 32088.750366 | 30025.930963 | ... | 204.869050 | 350.194879 | 387.535543 | 566.165208 | 394.599993 | 239.182093 | 0.661402 | 0.461577 | 51.469564 | 0.926652 |
| 1 | 24 | 468 | 3723.0 | Rapar | MULTIPOLYGON (((70.53440 23.46489, 70.53222 23... | 24 | 468 | 201902.906291 | 105105.320376 | 96796.567185 | ... | 861.844679 | 471.671497 | 645.874145 | 339.236735 | 2531.541403 | 1269.336253 | 0.873535 | 0.382873 | 397.304285 | 0.897813 |
| 2 | 24 | 468 | 3724.0 | Bhachau | MULTIPOLYGON (((70.45008 23.01226, 70.44904 23... | 24 | 468 | 176504.000000 | 92678.000000 | 83460.000000 | ... | 790.000000 | 450.000000 | 498.000000 | 234.000000 | 2577.000000 | 468.000000 | 0.623509 | 0.135301 | 274.000000 | 0.933517 |
3 rows × 25 columns
# Because pc11_state_id_x = pc11_state_id_y and pc11_state_id_x = pc11_state_id_y
# We will delete column of y
# and rename the column pc11_state_id_x = pc11_state_id and pc11_state_id_x = pc11_state_id
geo_pop = geo_pop.drop(['pc11_state_id_y', 'pc11_district_id_y'], axis=1)
# Rename columns
geo_pop = geo_pop.rename(columns={'pc11_state_id_x': 'pc11_state_id', 'pc11_district_id_x': 'pc11_district_id'})
Second, DISTRICT SHRUG + GPKG
Why doing this:
mapdist_path = './data/shrug-pc11dist-poly-gpkg/district.gpkg'
map_dist = gpd.read_file(mapdist_path)
Since the GPKG data above is in district, while "subset_df" is in subdistrict, we need to create pivot containing sum of all the columns (total population, total childs, etc)
columns_to_sum = ['total_population', 'male_population', 'female_population', 'total_hhd',
'total_grd_and_pg_in_village', 'total_shg',
'total_no_of_lactating_mothers', 'total_no_of_lactating_mothers_re',
'total_childs_aged_0_to_3_years', 'total_childs_aged_0_to_3_years_i',
'total_no_of_beneficiaries_receiv','total_no_of_eligible_beneficiari',
'total_trained_trainee_under_skil','total_no_of_elect_rep_oriented_u']
# Create a pivot table
pivot_table_df = subset_df.pivot_table(values=columns_to_sum, index='pc11_district_id', aggfunc='sum')
pivot_table_df.head(3)
| female_population | male_population | total_childs_aged_0_to_3_years | total_childs_aged_0_to_3_years_i | total_grd_and_pg_in_village | total_hhd | total_no_of_beneficiaries_receiv | total_no_of_elect_rep_oriented_u | total_no_of_eligible_beneficiari | total_no_of_lactating_mothers | total_no_of_lactating_mothers_re | total_population | total_shg | total_trained_trainee_under_skil | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| pc11_district_id | ||||||||||||||
| 1 | 23523.138329 | 26510.879626 | 2761.074569 | 2590.694210 | 704.340235 | 8981.478930 | 168.859106 | 115.615244 | 206.890436 | 505.056064 | 479.194760 | 50041.624221 | 1.521253 | 109.530231 |
| 2 | 360076.838747 | 392652.138478 | 29465.549515 | 19313.974815 | 33116.408459 | 124419.391360 | 1013.957246 | 227.937998 | 1571.817218 | 4322.707964 | 3710.070887 | 755114.634253 | 1291.782192 | 1633.911806 |
| 3 | 45598.474752 | 47615.448440 | 2837.299473 | 2422.499434 | 3636.436542 | 16684.780804 | 242.444825 | 194.343043 | 314.018065 | 791.205308 | 756.308394 | 93219.355902 | 350.046346 | 191.291275 |
Do the merge of the two dataset (GPKG and pivot_table) by the same step as subdistrict above.
# Convert to astype
map_dist.pc11_district_id = map_dist.pc11_district_id.astype(float)
geo_pop.pc11_district_id = geo_pop.pc11_district_id.astype(float)
pivot_table_df.index = pivot_table_df.index.astype(float)
# Next, merge using the indices
geo_pop_district = map_dist.merge(pivot_table_df, left_on='pc11_district_id', right_index=True)
geo_pop_district.head(3)
| pc11_state_id | pc11_district_id | district_name | geometry | female_population | male_population | total_childs_aged_0_to_3_years | total_childs_aged_0_to_3_years_i | total_grd_and_pg_in_village | total_hhd | total_no_of_beneficiaries_receiv | total_no_of_elect_rep_oriented_u | total_no_of_eligible_beneficiari | total_no_of_lactating_mothers | total_no_of_lactating_mothers_re | total_population | total_shg | total_trained_trainee_under_skil | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 24 | 468.0 | Kachchh | MULTIPOLYGON (((70.45008 23.01226, 70.44904 23... | 6.918732e+05 | 7.659721e+05 | 52789.062307 | 36124.140136 | 29509.874064 | 332446.747616 | 3933.575088 | 2623.179898 | 4593.834795 | 7194.496554 | 6452.874545 | 1.465846e+06 | 5262.200653 | 9285.792694 |
| 1 | 24 | 469.0 | Banas Kantha | MULTIPOLYGON (((71.24964 24.20926, 71.24207 24... | 1.490841e+06 | 1.603543e+06 | 137018.185689 | 93865.516320 | 103408.588403 | 607873.050328 | 9331.840111 | 8527.665757 | 11595.960114 | 19340.773617 | 16878.197941 | 3.273774e+06 | 13546.601715 | 24814.951196 |
| 2 | 24 | 470.0 | Patan | MULTIPOLYGON (((71.42507 23.96967, 71.42497 23... | 5.204153e+05 | 5.651527e+05 | 49457.172519 | 30594.736482 | 64222.702638 | 258319.762048 | 1643.088208 | 3397.482485 | 2437.191204 | 9340.013570 | 7797.987739 | 1.089937e+06 | 14351.448633 | 30910.742259 |
Third, DISTRICT + SUBDISTRICT
As I have explained before, we will complete the subset_df in district level with "real district" name so it is more intuitive.
# Convert the object to astype
geo_pop_district.pc11_district_id = geo_pop_district.pc11_district_id.astype(float)
geo_pop.pc11_district_id = geo_pop.pc11_district_id.astype(float)
# Next, merge using the indices
geo_pop = pd.merge(geo_pop, geo_pop_district[['pc11_district_id', 'district_name']], on='pc11_district_id', how='left')
# Also, do not forget to lowercase the strings
geo_pop['district_name'] = geo_pop['district_name'].str.lower()
geo_pop['subdistrict_name'] = geo_pop['subdistrict_name'].str.lower()
geo_pop.head(3)
# District name is on the very right on the review below.
| pc11_state_id | pc11_district_id | pc11_subdistrict_id | subdistrict_name | geometry | total_population | male_population | female_population | total_no_of_registered_children_ | total_hhd | ... | total_no_of_beneficiaries_receiv | total_no_of_eligible_beneficiari | total_shg | total_grd_and_pg_in_village | total_trained_trainee_under_skil | availability_of_mother_child_hea | availability_of_adult_edu_centre | total_no_of_elect_rep_oriented_u | public_info_board | district_name | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 24 | 468.0 | 3722.0 | lakhpat | MULTIPOLYGON (((68.39398 23.44476, 68.39264 23... | 62117.708950 | 32088.750366 | 30025.930963 | 4008.570778 | 12319.391613 | ... | 350.194879 | 387.535543 | 566.165208 | 394.599993 | 239.182093 | 0.661402 | 0.461577 | 51.469564 | 0.926652 | kachchh |
| 1 | 24 | 468.0 | 3723.0 | rapar | MULTIPOLYGON (((70.53440 23.46489, 70.53222 23... | 201902.906291 | 105105.320376 | 96796.567185 | 9754.329549 | 42524.801920 | ... | 471.671497 | 645.874145 | 339.236735 | 2531.541403 | 1269.336253 | 0.873535 | 0.382873 | 397.304285 | 0.897813 | kachchh |
| 2 | 24 | 468.0 | 3724.0 | bhachau | MULTIPOLYGON (((70.45008 23.01226, 70.44904 23... | 176504.000000 | 92678.000000 | 83460.000000 | 9545.000000 | 41865.000000 | ... | 450.000000 | 498.000000 | 234.000000 | 2577.000000 | 468.000000 | 0.623509 | 0.135301 | 274.000000 | 0.933517 | kachchh |
3 rows × 24 columns
Step 03: Joining the Two Dataframe
Next step is to join data from step 01 (ogd_data_pivot, containing availability data of Adult Education Center and Health Facilities) and step 02 (geo_pop, containg all socio-economic data). Both will be combined with respect to "subdistrict_id".
Why do we use "subdistrict_id"?
Well, it is basically a long process and non-linear.
# First, we need to rename the old column of 'availability_of_mother_child_hea', 'availability_of_adult_edu_centre' in geo_pop
geo_pop.rename(columns={'availability_of_mother_child_hea' : 'availability_of_mother_child_hea_OLD'}, inplace=True)
geo_pop.rename(columns={'availability_of_adult_edu_centre' : 'availability_of_adult_edu_centre_OLD'}, inplace=True)
# or we can drop instead
#geo_pop.drop(['availability_of_mother_child_hea'], axis=1, inplace=True)
#geo_pop.drop(['availability_of_adult_edu_centre'], axis=1, inplace=True)
# Second, because subdistrict id column of ogd data is named 'sub_district_code' while it is named 'pc11_subdistrict_id' in geopop (notice the use of "_"),
# we will change the name of the ogd column
ogd_data_pivot.rename(columns={'sub_district_code':'pc11_subdistrict_id'},inplace=True)
# Third, merge dataframes based on 'subdistrict_name' and 'district_name'
dataframe_merged = pd.merge(geo_pop, ogd_data_pivot[['pc11_subdistrict_id','availability_of_mother_child_health_facilities', 'availability_of_adult_edu_centre']],
on=['pc11_subdistrict_id'], how='left')
dataframe_merged.head()
| pc11_state_id | pc11_district_id | pc11_subdistrict_id | subdistrict_name | geometry | total_population | male_population | female_population | total_no_of_registered_children_ | total_hhd | ... | total_shg | total_grd_and_pg_in_village | total_trained_trainee_under_skil | availability_of_mother_child_hea_OLD | availability_of_adult_edu_centre_OLD | total_no_of_elect_rep_oriented_u | public_info_board | district_name | availability_of_mother_child_health_facilities | availability_of_adult_edu_centre | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 24 | 468.0 | 3722.0 | lakhpat | MULTIPOLYGON (((68.39398 23.44476, 68.39264 23... | 62117.708950 | 32088.750366 | 30025.930963 | 4008.570778 | 12319.391613 | ... | 566.165208 | 394.599993 | 239.182093 | 0.661402 | 0.461577 | 51.469564 | 0.926652 | kachchh | 1.0 | 2.0 |
| 1 | 24 | 468.0 | 3723.0 | rapar | MULTIPOLYGON (((70.53440 23.46489, 70.53222 23... | 201902.906291 | 105105.320376 | 96796.567185 | 9754.329549 | 42524.801920 | ... | 339.236735 | 2531.541403 | 1269.336253 | 0.873535 | 0.382873 | 397.304285 | 0.897813 | kachchh | 1.0 | 2.0 |
| 2 | 24 | 468.0 | 3724.0 | bhachau | MULTIPOLYGON (((70.45008 23.01226, 70.44904 23... | 176504.000000 | 92678.000000 | 83460.000000 | 9545.000000 | 41865.000000 | ... | 234.000000 | 2577.000000 | 468.000000 | 0.623509 | 0.135301 | 274.000000 | 0.933517 | kachchh | 2.0 | 2.0 |
| 3 | 24 | 468.0 | 3725.0 | anjar | POLYGON ((70.23631 23.40849, 70.23631 23.40545... | 150105.112093 | 79896.579777 | 70092.396031 | 7687.807328 | 34191.766772 | ... | 491.505351 | 3643.775958 | 1258.834380 | 0.830914 | 0.059282 | 239.531089 | 1.000000 | kachchh | 1.0 | 2.0 |
| 4 | 24 | 468.0 | 3726.0 | bhuj | POLYGON ((69.78433 23.99110, 69.79143 23.98750... | 238834.277866 | 119211.778880 | 113669.738389 | 9341.469566 | 54106.743999 | ... | 230.560895 | 2819.277177 | 1145.297841 | 0.486578 | 0.140952 | 233.778024 | 0.638809 | kachchh | 2.0 | 2.0 |
5 rows × 26 columns
# Check the result by counting NaN values in 'availability_of_mother_child_health_facilities' and 'availability_of_adult_edu_centre' columns
dataframe_merged[['availability_of_mother_child_health_facilities', 'availability_of_adult_edu_centre']].isna().sum()
availability_of_mother_child_health_facilities 33 availability_of_adult_edu_centre 33 dtype: int64
We can delete the NAN rows above, because 33 is quite small (33/5454 = 0,6%) which is negligible.
dataframe_merged.dropna(subset=['availability_of_mother_child_health_facilities', 'availability_of_adult_edu_centre'], inplace=True)
The dataframe is better ready to be plotted, let us see on the next steps!!
Let us check again how each of the variable goes.
# I defined which column that I will plot in histogram
columns_hist = [
'total_childs_aged_0_to_3_years_i',
'total_population',
'total_no_of_beneficiaries_receiv',
'total_no_of_lactating_mothers_re',
'total_shg',
'total_grd_and_pg_in_village',
'total_trained_trainee_under_skil',
'total_no_of_elect_rep_oriented_u',
'availability_of_mother_child_health_facilities',
'availability_of_adult_edu_centre'
]
# Then, I set the number of columns and rows for the layout
num_cols = 2
num_rows = 5
# Set up subplots for each column
fig, axes = plt.subplots(nrows=num_rows, ncols=num_cols, figsize=(10, 16))
axes = axes.flatten()
# Create histograms in each of the columns
for i, column in enumerate(columns_hist):
axes[i].hist(dataframe_merged[column].dropna(), bins=75, color='blue', alpha=0.7)
axes[i].set_xlabel('Value')
axes[i].set_ylabel('Frequency')
if column == 'total_grd_and_pg_in_village':
axes[i].set_title('Number of Graduates and Postgrads')
axes[i].set_xlim(0, 50000) # Set x-limit from 0 to 10000
if column == 'total_shg':
axes[i].set_title('Number of SHGs')
axes[i].set_xlim(0, 8000) # Set x-limit from 0 to 10000
if column == 'total_no_of_lactating_mothers_re':
axes[i].set_title('Number of Lactating Mothers Receiving ICDS')
axes[i].set_xlim(0, 10000) # Set x-limit from 0 to 10000
if column == 'total_childs_aged_0_to_3_years_i':
axes[i].set_title('Number of Children Age 0-3 Years Immunized')
axes[i].set_xlim(0, 30000) # Set x-limit from 0 to 30000
if column == 'total_no_of_beneficiaries_receiv':
axes[i].set_title('Number of PMMVY Beneficiaries')
axes[i].set_xlim(0, 3000) # Set x-limit from 0 to 3000
if column == 'total_trained_trainee_under_skil':
axes[i].set_title('Number of Trainees')
axes[i].set_xlim(0, 10000) # Set x-limit from 0 to 10000
if column == 'availability_of_adult_edu_centre':
axes[i].set_title('Availability of Adult Edu Center')
if column == 'total_no_of_elect_rep_oriented_u':
axes[i].set_title('Number of Elected Representative for Govt Program')
axes[i].set_xlim(0, 3000) # Set x-limit from 0 to 3000
if column == 'availability_of_mother_child_health_facilities':
axes[i].set_title('Availability of Mother & Child Health Facil.')
if column == 'total_population':
axes[i].set_title('Total Population in each of Sub District')
axes[i].set_xlim(0, 1000000) # Set x-limit from 0 to 10000
# Add a single title for all subplots
fig.suptitle('Histogram of Socioeconomic Key Variables in India', fontsize=16)
plt.tight_layout()
plt.show()
We can see that the data of Availability is fixed, as it should be defined on the documentation, without modification of old data (for example, 2011 cencus that might be out dated) (same context with what the documentation of SHRUG describes!). We can move then to next steps!
Next, do some calculation (adding new important variables). We will need to make a new variables, so that it is more inclusive and fair (it is possible to have a case like the number of immunized children is higher in certain area because it has more children!).
Part of Non Linear Process:
There are some interesting findings regarding treating per capita or rate variables:
Thus, depends on the purpose, "rate of immunization level" vs "absolute number of children with immunization" can be used both or only one of them. Here, we will see how this affects the regression.
Let us continue first to make the "rate" variables.
List of new variables:
Note:
Reference: [1] Uslaner, E. M. (1976). The Pitfalls of Per Capita. American Journal of Political Science, 20(1), 125–133. https://doi.org/10.2307/2110513
The same reason also applied for these variables below, because in the future, it might be needed for per capita calculation (thus, if the value is NaN or zero, it will just result in NA or 0/0). The analysis will not be applicable of there is no children in an area or even no mother there.
# Drop rows where specified columns contain only 0 for lactating mothers
columns_to_remove_zero = ['total_no_of_lactating_mothers']
dataframe_merged = dataframe_merged.loc[~(dataframe_merged[columns_to_remove_zero] == 0).all(axis=1)]
# Do the same for total child ages 0-3 yrs
columns_to_remove_zero = ['total_childs_aged_0_to_3_years']
dataframe_merged = dataframe_merged.loc[~(dataframe_merged[columns_to_remove_zero] == 0).all(axis=1)]
# And others:
columns_to_remove_zero = ['total_population']
dataframe_merged = dataframe_merged.loc[~(dataframe_merged[columns_to_remove_zero] == 0).all(axis=1)]
columns_to_remove_zero = ['total_no_of_eligible_beneficiari']
dataframe_merged = dataframe_merged.loc[~(dataframe_merged[columns_to_remove_zero] == 0).all(axis=1)]
# No 01
dataframe_merged['immunized_child_pct'] = (dataframe_merged['total_childs_aged_0_to_3_years_i'] / dataframe_merged['total_childs_aged_0_to_3_years']).round(3) * 100
# No 02
dataframe_merged['grd_rat'] = (dataframe_merged['total_grd_and_pg_in_village'] / dataframe_merged['total_population']).round(3) * 100
# No 03
dataframe_merged['shg_rat'] = (dataframe_merged['total_shg'] / dataframe_merged['total_population']).round(5) * 100
# No 04
dataframe_merged['icds_women_pct'] = (dataframe_merged['total_no_of_lactating_mothers_re'] / dataframe_merged['total_no_of_lactating_mothers']).round(3) * 100
# No 05
dataframe_merged['pmmvy_pct'] = (dataframe_merged['total_no_of_beneficiaries_receiv'] / dataframe_merged['total_no_of_eligible_beneficiari']).round(3) * 100
# No 06
dataframe_merged['trained_ratio'] = (dataframe_merged['total_trained_trainee_under_skil'] / dataframe_merged['total_population']).round(3) * 100
# No 07
dataframe_merged['trained_repr_rat'] = (dataframe_merged['total_no_of_elect_rep_oriented_u'] / dataframe_merged['total_population']).round(3) * 100
Reflection
CDS Framework in Data Transforming
Technicality concerns:
Graphical excellence in this Part 03:
In this part, there will be two main activities are: Part A: Regression Preparation Activities:
Part B: Regression Process and Evaluation Activities:
Need to be evaluated in each of Regression:
Before going further, why do we need to do Regression (using OLS)?
Part A: Regression Preparation
01 - Outlier Check and Elimination
Let us check first the type of data, to make sure that the desired variables can be processed.
dataframe_merged.dtypes
pc11_state_id object pc11_district_id float64 pc11_subdistrict_id float64 subdistrict_name object geometry geometry total_population float64 male_population float64 female_population float64 total_no_of_registered_children_ float64 total_hhd float64 total_childs_aged_0_to_3_years float64 total_childs_aged_0_to_3_years_i float64 total_no_of_lactating_mothers float64 total_no_of_lactating_mothers_re float64 total_no_of_beneficiaries_receiv float64 total_no_of_eligible_beneficiari float64 total_shg float64 total_grd_and_pg_in_village float64 total_trained_trainee_under_skil float64 availability_of_mother_child_hea_OLD float64 availability_of_adult_edu_centre_OLD float64 total_no_of_elect_rep_oriented_u float64 public_info_board float64 district_name object availability_of_mother_child_health_facilities float64 availability_of_adult_edu_centre float64 immunized_child_pct float64 grd_rat float64 shg_rat float64 icds_women_pct float64 pmmvy_pct float64 trained_ratio float64 trained_repr_rat float64 dtype: object
Next, let us see the scale in each of the variables.
# Specify the columns you want to create boxplots for
columns_name = [
'total_population',
'total_childs_aged_0_to_3_years_i',
'total_no_of_lactating_mothers_re',
'total_no_of_beneficiaries_receiv',
'total_grd_and_pg_in_village',
'total_no_of_elect_rep_oriented_u',
'total_trained_trainee_under_skil',
'total_shg',
]
# Specify the names for y-axis labels, so it can easily be read by you
columns_title = ['Total Population in Subdistricts','Childs (0-3) Immunized', 'Lact. Mothers with ICDS', 'PMMVY Benefic.', 'Hi Edu Level', 'Representatives', 'Under Training', 'SHGs']
# Create subplots
fig, axs = plt.subplots(ncols=len(columns_name), figsize=(20, 5))
# Plot boxplots for each selected column
for i, column in enumerate(columns_name):
sns.boxplot(y=dataframe_merged[column], ax=axs[i])
# Add line for 3 times the standard deviation
mean_value = dataframe_merged[column].mean()
std_dev = dataframe_merged[column].std()
threshold = mean_value + 3 * std_dev
axs[i].axhline(y=threshold, color='r', linestyle='--', label='3x SD')
# Set y-axis label
axs[i].set_ylabel(columns_title[i])
# Add a title to the entire set of subplots
plt.suptitle("Box Plot of Socioeconomics data related to Child Immunization in India", y=1.02, fontsize=16)
# Adjust layout
plt.tight_layout(pad=0.4, w_pad=0.5, h_pad=1.0)
plt.show()
We can see that other than Total Population, the independent variables are around the same scale (scale of thousands).
The Red Line above shows 3 x standard deviation. Some practice to eliminate the outliers after making the boxplot such above, is to delete the value that is outside of 3 x standard deviation.
This 3x stdev corresponds to 99.7% of the whole data assuming gaussian profiles.
Some sources related to this:
Let us see how much this outliers in each of the columns/ variables:
for column in columns_name:
mean_value = dataframe_merged[column].mean()
std_dev = dataframe_merged[column].std()
# Calculate the threshold for outliers (beyond 3 times the standard deviation)
threshold = mean_value + 3 * std_dev
# Identify outliers
outliers = dataframe_merged[column][dataframe_merged[column] > threshold]
# Calculate the percentage of outliers
perc = len(outliers) * 100.0 / len(dataframe_merged)
print(f"Column {column} has outliers = {perc:.2f}%")
print(f"3x Standard Deviation value is: {threshold:.2f}\n")
Column total_population has outliers = 2.20% 3x Standard Deviation value is: 704494.28 Column total_childs_aged_0_to_3_years_i has outliers = 2.52% 3x Standard Deviation value is: 19668.88 Column total_no_of_lactating_mothers_re has outliers = 2.46% 3x Standard Deviation value is: 4963.21 Column total_no_of_beneficiaries_receiv has outliers = 1.93% 3x Standard Deviation value is: 2285.03 Column total_grd_and_pg_in_village has outliers = 2.59% 3x Standard Deviation value is: 34710.17 Column total_no_of_elect_rep_oriented_u has outliers = 1.69% 3x Standard Deviation value is: 2273.61 Column total_trained_trainee_under_skil has outliers = 1.91% 3x Standard Deviation value is: 4905.73 Column total_shg has outliers = 1.41% 3x Standard Deviation value is: 4731.48
As we can see above, the percentage is quite large in each of column (think about: if you eliminate outliers in each of columns, the total data that are erased will be accumulated. At first run, I lost ~10% of the entire data).
As a trade off, we will delete the outliers on column that is related to control variables (the per capita variables): rows with too big total population, total number of immunization on children, and number of lactating mothers receiving ICDS.
Let us try to eliminate rows with total_population > 704494.28. Next, do the same in each of column (with each of > 3x st dev).
dfreg_cut = dataframe_merged[dataframe_merged['total_population'] < 704494.28]
#dfreg_cut = dataframe_merged[dataframe_merged['total_population'] > 5000]
dfreg_cut = dfreg_cut[dfreg_cut['total_childs_aged_0_to_3_years_i'] < 19668.88]
dfreg_cut = dfreg_cut[dfreg_cut['total_no_of_lactating_mothers_re'] < 4963.21]
# How much is reduced?
dfreg_cut.shape
(5137, 33)
# previous data:
dataframe_merged.shape
(5327, 33)
Compared to previous size of data: 5327, the final data after outliers reduction is 5137 (already 3,6%). Let us move and get back here if further elimination is needed!
02 - Correlation Mapping
We will use Correlation Mapping to show whether our chosen broad independent variables are good predictors. Since we can not do this with Availability (categorical) variables, we will only examine the numerical variables!
# Declaring which columns that we want to see the correlation
columns_corr = [
'total_childs_aged_0_to_3_years_i',
'total_no_of_lactating_mothers_re',
'total_no_of_beneficiaries_receiv',
'total_grd_and_pg_in_village',
'total_trained_trainee_under_skil',
'total_shg',
'total_no_of_elect_rep_oriented_u',
]
# Create a subset dataframe with only the specified columns as specified above
subset_corr_df = dfreg_cut[columns_corr]
# Change the name of the variable so that it is understandable by HUMAN!
subset_corr_df.columns = ['Childs (0-3) Immunized','Lact. Mothers with ICDS','PMMVY Benefic.','Hi Edu Level','Under Training','SHS','No of Represent.']
# Create a pair plot
pair_plot = sns.pairplot(subset_corr_df, kind='scatter', diag_kind='hist')
# Change the colour to black (scatter plot)
for i, j in zip(*plt.np.triu_indices_from(pair_plot.axes, 1)):
sns.scatterplot(x=subset_corr_df.columns[j], y=subset_corr_df.columns[i], data=subset_corr_df, color='black', ax=pair_plot.axes[i, j])
# Add orange linear regression lines with confidence intervals
for i, j in zip(*plt.np.triu_indices_from(pair_plot.axes, 1)):
sns.regplot(x=subset_corr_df.columns[j], y=subset_corr_df.columns[i], data=subset_corr_df, scatter=False, color='orange', ax=pair_plot.axes[i, j])
# Add title
pair_plot.fig.suptitle('Pair Plot of Key Variables regarding Immunization Level of Childs (0-3) in India', y=1.02, fontsize=16)
# Show the plot
plt.show()
We can see that some of the independent variables move together with the dependent variable: the total (absolute) number of children (0-3 years) with immunization. We can see better using one representative number below.
# Create a heatmap of the correlation matrix
correlation_matrix = subset_corr_df.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=.5)
# Show the plot
plt.title('Heatmap of Socio-economic Variables related to Immunization of Childs (0-3) in India', fontsize=12)
plt.show()
Some important that we get from the heatmap above:
Next, let us see the per capita variables that we have derived before!
# Define which variables we want to observe
columns_name = [
'immunized_child_pct',
'grd_rat',
'shg_rat',
'icds_women_pct',
'pmmvy_pct',
'trained_ratio',
'trained_repr_rat'
]
# Focus on dependent vs all independent variables
subset_corr_df = dfreg_cut[columns_name]
subset_corr_df.columns = ['Childs (0-3) Immunized','Lact. Mothers with ICDS','PMMVY Benefic.','Hi Edu Level','Representatives','Under Training','SHGs']
# Create a pair plot with 'percent_immunized_child' on the y-axis and others on the x-axis
pair_plot = sns.pairplot(subset_corr_df, y_vars=['Childs (0-3) Immunized'], x_vars=subset_corr_df.columns[1:], kind='scatter', diag_kind='hist')
# Show the plot
pair_plot.fig.suptitle('Pair Plot of Key Variables regarding Immunization Level of Childs (0-3) in India', y=1.12, fontsize=12)
plt.show()
Hmm... yes, it is not nice looking graph because the correlation seems unclear! What about the correlation?
# Create a heatmap of the correlation matrix
correlation_matrix = subset_corr_df.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=.5)
# Show the plot
plt.title('Heatmap of Socio-economic Variables (per capita) related to Immunization of Childs (0-3) in India', fontsize=12)
plt.show()
To make sure the significance, we will also calculate p value. Low p-value (typically under 5%) indicating a significant correlation between the variables. The null hypothesis for a correlation test typically assumes that there is no correlation between the variables.
# Calculate p-values for 'Childs (0-3) Immunized' vs other variables
p_values_list = []
for col in subset_corr_df.columns[1:]:
_, p_value = pearsonr(subset_corr_df['Childs (0-3) Immunized'], subset_corr_df[col])
p_values_list.append(('Childs (0-3) Immunized', col, p_value))
# Show it:
for p_value_info in p_values_list:
print(f"P-value between {p_value_info[0]} and {p_value_info[1]}: {p_value_info[2]:.3f}")
P-value between Childs (0-3) Immunized and Lact. Mothers with ICDS: 0.000 P-value between Childs (0-3) Immunized and PMMVY Benefic.: 0.000 P-value between Childs (0-3) Immunized and Hi Edu Level: 0.000 P-value between Childs (0-3) Immunized and Representatives: 0.000 P-value between Childs (0-3) Immunized and Under Training: 0.036 P-value between Childs (0-3) Immunized and SHGs: 0.000
Reflection
Based on the provided p-values:
Almost all of the chosen independent variables has low p-values (<5%) when tested againts the Percentage of Children with Immunization.
These small p-values support the idea that there is a statistically significant correlation between the number of immunized children (age 0-3 years) and each of the independent variables (accept the H0).
However, some independent variables are having correlation to other independent variables! We will need to evaluate it on next research because there is possibility of bias (foe example, instead of independent variables, there might be dependent variables that will be unveils). For temporary time, let us evaluate those variables using Regression Methods.
Critical DS Framework Reflection
03. One Hot Encoding of Categorical Variables
We have two categorical variables: 'availability_of_adult_edu_centre', 'availability_of_mother_child_health_facilities',
As written at the documentation:
Since it is only two categorical variables, basically One Hot Encoding is not necessary. However, as part of non linear process, I found some error during regression process because the '1' and '2' default dtypes is not int. Thus, we will still need to make the One Hot Encoding --> convert to TRUE and FALSE --> convert to int --> do regression.
Further, although One Hot Encoding is mainly for categorical variables that contains >2 class, I will still attach the process here because in the future might need it (for instance, when we will need to put >2 categorical variable: 1, 2, 3, etc).
If we encode categories as 0, 1, and 2, the model might interpret it as an ordinal scale, assuming that the relationship between 0 and 1 is similar to that between 1 and 2. One-hot encoding eliminates this issue by creating binary vectors without any ordinal implication.
categorical_column = 'availability_of_mother_child_health_facilities'
# Performing one-hot encoding
dfreg_encoded = pd.get_dummies(dfreg_cut, columns=[categorical_column])
# If you do not need one hot encoding, just delete below code or add '#':
#dfreg_encoded = dfreg_cut
# Let us see the effect
dfreg_encoded.columns
Index(['pc11_state_id', 'pc11_district_id', 'pc11_subdistrict_id',
'subdistrict_name', 'geometry', 'total_population', 'male_population',
'female_population', 'total_no_of_registered_children_', 'total_hhd',
'total_childs_aged_0_to_3_years', 'total_childs_aged_0_to_3_years_i',
'total_no_of_lactating_mothers', 'total_no_of_lactating_mothers_re',
'total_no_of_beneficiaries_receiv', 'total_no_of_eligible_beneficiari',
'total_shg', 'total_grd_and_pg_in_village',
'total_trained_trainee_under_skil',
'availability_of_mother_child_hea_OLD',
'availability_of_adult_edu_centre_OLD',
'total_no_of_elect_rep_oriented_u', 'public_info_board',
'district_name', 'availability_of_adult_edu_centre',
'immunized_child_pct', 'grd_rat', 'shg_rat', 'icds_women_pct',
'pmmvy_pct', 'trained_ratio', 'trained_repr_rat',
'availability_of_mother_child_health_facilities_1.0',
'availability_of_mother_child_health_facilities_2.0'],
dtype='object')
dfreg_encoded.head(3)
| pc11_state_id | pc11_district_id | pc11_subdistrict_id | subdistrict_name | geometry | total_population | male_population | female_population | total_no_of_registered_children_ | total_hhd | ... | availability_of_adult_edu_centre | immunized_child_pct | grd_rat | shg_rat | icds_women_pct | pmmvy_pct | trained_ratio | trained_repr_rat | availability_of_mother_child_health_facilities_1.0 | availability_of_mother_child_health_facilities_2.0 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 24 | 468.0 | 3722.0 | lakhpat | MULTIPOLYGON (((68.39398 23.44476, 68.39264 23... | 62117.708950 | 32088.750366 | 30025.930963 | 4008.570778 | 12319.391613 | ... | 2.0 | 85.0 | 0.6 | 0.911 | 90.2 | 90.4 | 0.4 | 0.1 | True | False |
| 1 | 24 | 468.0 | 3723.0 | rapar | MULTIPOLYGON (((70.53440 23.46489, 70.53222 23... | 201902.906291 | 105105.320376 | 96796.567185 | 9754.329549 | 42524.801920 | ... | 2.0 | 72.2 | 1.3 | 0.168 | 92.7 | 73.0 | 0.6 | 0.2 | True | False |
| 2 | 24 | 468.0 | 3724.0 | bhachau | MULTIPOLYGON (((70.45008 23.01226, 70.44904 23... | 176504.000000 | 92678.000000 | 83460.000000 | 9545.000000 | 41865.000000 | ... | 2.0 | 63.4 | 1.5 | 0.133 | 83.0 | 90.4 | 0.3 | 0.2 | False | True |
3 rows × 34 columns
From output above, it is clear that there are two new variables: availability_of_mother_child_health_facilities_1.0 and availability_of_mother_child_health_facilities_2.0.
Next, we will change the '1.0' to YES and '2.0' to NO for handy understanding.
# Rename the columns
dfreg_encoded = dfreg_encoded.rename(columns={'availability_of_mother_child_health_facilities_1.0' : 'avail_of_mc_health_facilities_YES', 'availability_of_mother_child_health_facilities_2.0' : 'avail_of_mc_health_facilities_NO'})
# And convert to INT so regression can do the job
dfreg_encoded.avail_of_mc_health_facilities_YES = dfreg_encoded.avail_of_mc_health_facilities_YES.astype(int)
dfreg_encoded.avail_of_mc_health_facilities_NO = dfreg_encoded.avail_of_mc_health_facilities_NO.astype(int)
# Do the same for availability_of_adult_edu_centre variables
categorical_column = 'availability_of_adult_edu_centre'
# Performing one-hot encoding
dfreg_encoded = pd.get_dummies(dfreg_encoded, columns=[categorical_column])
# Rename the columns
dfreg_encoded = dfreg_encoded.rename(columns={'availability_of_adult_edu_centre_1.0' : 'avail_of_adult_edu_centre_YES', 'availability_of_adult_edu_centre_2.0' : 'avail_of_adult_edu_centre_NO'})
# And convert to INT so regression can do the job
dfreg_encoded.avail_of_adult_edu_centre_YES = dfreg_encoded.avail_of_adult_edu_centre_YES.astype(int)
dfreg_encoded.avail_of_adult_edu_centre_NO = dfreg_encoded.avail_of_adult_edu_centre_NO.astype(int)
Notice that we do not delete the first variable that is usually done during the dummy var making stage. It is because we can choose which variable to be put into the Linear Regression instead of deleting it, for the sake of practice and keep the original variables.
04. Train Test Split
Train and Test is split to rule 80:20 because we need 20% of the data as a measure whether our regression result can be applied to 'new cases'.
In order to make sure that the training and test data have appropriate representation of each state (because our data has one observation per subdistrict, it is just make sense that representatives are from district or state level); it would be bad for the training data to entirely miss a state! Thus, we will use stratified test/train split.
Note: In our data, there is also one district that has only one subdistrict data available --> can not use stratified! Hence, use state level instead as the stratify parameter.
# After processing:
train_data.shape, test_data.shape
((4109, 35), (1028, 35))
# This process use sklearn library
train_data, test_data = train_test_split(dfreg_encoded, test_size = 0.2, stratify=dfreg_encoded['pc11_state_id'])
05. VIF to Detect Multicollinearity
Multicollinearity can make significant problems during regression model fitting. Here, we will use one of method to reduce multicollinearity called Variance Inflation Factor (VIF). A variance inflation factor (VIF) is a measure of the amount of multicollinearity in regression analysis. VIF is used to identify the correlation of one independent variable with a group of other variables.
VIF tells us the factor by which the correlations amongst the predictors inflate the variance. For example, a VIF of 10 indicates that the existing multicollinearity is inflating the variance of the coefficients 10 times compared to a no multicollinearity model.
VIFs do not have any upper limit. The lower the value the better.
Source: https://www.kaggle.com/code/marcinrutecki/multicollinearity-detection-and-remedies
Based on this interesting source (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6900425/): multicollinearity is present when the VIF is higher than 5 to 10 or the condition indices are higher than 10 to 30. Thus, there is no one fix number as the threshold, but we have threshold around 5 to 10.
# the independent variables:
X_var_vif = dfreg_encoded[[
'total_no_of_lactating_mothers_re',
'total_no_of_beneficiaries_receiv',
'total_grd_and_pg_in_village',
'total_trained_trainee_under_skil',
'total_no_of_elect_rep_oriented_u',
'total_shg']]
# VIF dataframe:
vif2_data = pd.DataFrame()
vif2_data["Feature"] = X_var_vif.columns
# calculating VIF for each feature
vif2_data["VIF"] = [variance_inflation_factor(X_var_vif.values, i)
for i in range(len(X_var_vif.columns))]
vif2_data
| Feature | VIF | |
|---|---|---|
| 0 | total_no_of_lactating_mothers_re | 5.839053 |
| 1 | total_no_of_beneficiaries_receiv | 4.274775 |
| 2 | total_grd_and_pg_in_village | 3.151885 |
| 3 | total_trained_trainee_under_skil | 2.575056 |
| 4 | total_no_of_elect_rep_oriented_u | 2.071100 |
| 5 | total_shg | 2.480037 |
Note: the categorical variables is excluded here because results ini error. It happens because the basic of calculating the VIF is determining coef of variance (R^2), while the variance of the categorical variable (for me) has less sense.
We can see from code above that two variables have relatively moderate VIF: total_no_of_lactating_mothers_re (total number of lactating mothers who receive ICDS subsidy) and total_no_of_beneficiaries_receiv (total number of beneficiaries receiving PMMVY subsidy). Basically they are two programs from India's Government, which has impact to health (including immunization level on child). However, I prefer to drop the total_no_of_beneficiaries_receiv because in the context of Immunization level, PMMVY has higher bias than ICDS:
Thus, let us drop the 2nd variable and see the VIF.
# the independent variables:
X_var_vif = dfreg_encoded[[
'total_no_of_lactating_mothers_re',
'total_grd_and_pg_in_village',
'total_trained_trainee_under_skil',
'total_no_of_elect_rep_oriented_u',
'total_shg']]
# VIF dataframe:
vif2_data = pd.DataFrame()
vif2_data["Feature"] = X_var_vif.columns
# calculating VIF for each feature
vif2_data["VIF"] = [variance_inflation_factor(X_var_vif.values, i)
for i in range(len(X_var_vif.columns))]
vif2_data
| Feature | VIF | |
|---|---|---|
| 0 | total_no_of_lactating_mothers_re | 4.283787 |
| 1 | total_grd_and_pg_in_village | 3.148524 |
| 2 | total_trained_trainee_under_skil | 2.542942 |
| 3 | total_no_of_elect_rep_oriented_u | 1.902099 |
| 4 | total_shg | 2.470081 |
The VIF of others variable above are not changing significantly. Hence, statistically, the multicollinearity is relatively low even if we do not exclude the 'total_no_of_beneficiaries_receiv'.
Part B: Regression Process
The regression process will be done in some activities:
Multivariate Regression
Multivariate Regression is chosen because we can see how each of variables weighing or impacting the dependent variables.
Let us see the performance of multivariate regression on the non per capita or non rate variables first.
# Define the dependent variables (output)
y_var = train_data['total_childs_aged_0_to_3_years_i']
# Define the broad independent variables that you want to narrow it down!
x_var = train_data[[
'total_no_of_lactating_mothers_re',
'total_no_of_beneficiaries_receiv',
'total_grd_and_pg_in_village',
'total_trained_trainee_under_skil',
'total_no_of_elect_rep_oriented_u',
'total_shg',
'avail_of_mc_health_facilities_YES',
'avail_of_adult_edu_centre_YES']].copy()
# Add a constant term to the independent variables matrix
x_var = sm.add_constant(x_var)
# Fit the OLS model
model = sm.OLS(y_var, x_var).fit()
# Print the summary of the regression
model.summary()
| Dep. Variable: | total_childs_aged_0_to_3_years_i | R-squared: | 0.764 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared: | 0.763 |
| Method: | Least Squares | F-statistic: | 1658. |
| Date: | Fri, 12 Jan 2024 | Prob (F-statistic): | 0.00 |
| Time: | 15:59:53 | Log-Likelihood: | -36244. |
| No. Observations: | 4109 | AIC: | 7.251e+04 |
| Df Residuals: | 4100 | BIC: | 7.256e+04 |
| Df Model: | 8 | ||
| Covariance Type: | nonrobust |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| const | 309.5529 | 48.995 | 6.318 | 0.000 | 213.496 | 405.610 |
| total_no_of_lactating_mothers_re | 2.9716 | 0.049 | 61.063 | 0.000 | 2.876 | 3.067 |
| total_no_of_beneficiaries_receiv | 0.2963 | 0.081 | 3.672 | 0.000 | 0.138 | 0.454 |
| total_grd_and_pg_in_village | 0.0599 | 0.005 | 12.090 | 0.000 | 0.050 | 0.070 |
| total_trained_trainee_under_skil | -0.0408 | 0.025 | -1.608 | 0.108 | -0.091 | 0.009 |
| total_no_of_elect_rep_oriented_u | -0.0839 | 0.056 | -1.491 | 0.136 | -0.194 | 0.026 |
| total_shg | 0.2446 | 0.026 | 9.501 | 0.000 | 0.194 | 0.295 |
| avail_of_mc_health_facilities_YES | -38.6552 | 57.448 | -0.673 | 0.501 | -151.284 | 73.973 |
| avail_of_adult_edu_centre_YES | 939.6405 | 186.221 | 5.046 | 0.000 | 574.547 | 1304.734 |
| Omnibus: | 1262.807 | Durbin-Watson: | 2.016 |
|---|---|---|---|
| Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 11077.002 |
| Skew: | 1.210 | Prob(JB): | 0.00 |
| Kurtosis: | 10.671 | Cond. No. | 6.86e+04 |
Well, as we can see above on OLS result:
In mean time, it is better one to drop the categorical variables. Reason: if you look at histogram of availability_of_adult_edu_centre, more than 95% subdistricts have no adult education center. Hence, it is basically insignificant.
That will leave one categorical variable: availability of mother and child facilities. This variable will be analyzed to see how much it represents the prediction. Here, because the value is boolean, we will need to scale other variables. Just for handy practice, we will use sklearn for temporary use.
# We will use SKLEARN temporary, because they provide easier use of Scaling (Standard Scaler)
# Define the dependent variable (output)
y_var = train_data['total_childs_aged_0_to_3_years_i']
# Define the independent variables
x_var = train_data[[
'total_no_of_lactating_mothers_re',
'total_no_of_beneficiaries_receiv',
'total_grd_and_pg_in_village',
'total_trained_trainee_under_skil',
'total_no_of_elect_rep_oriented_u',
'total_shg',
'avail_of_mc_health_facilities_YES'
]].copy()
X_train, X_test, y_train, y_test = train_test_split(x_var, y_var, test_size=0.2, random_state=42)
# Standardize numerical variables
continuous_vars = [
'total_no_of_lactating_mothers_re',
'total_no_of_beneficiaries_receiv',
'total_grd_and_pg_in_village',
'total_trained_trainee_under_skil',
'total_no_of_elect_rep_oriented_u',
'total_shg'
]
scaler = StandardScaler()
X_train[continuous_vars] = scaler.fit_transform(X_train[continuous_vars])
X_test[continuous_vars] = scaler.transform(X_test[continuous_vars])
# Linear regression model
model = LinearRegression()
model.fit(X_train, y_train)
# Make predictions on the test set
y_pred = model.predict(X_test)
# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r_squared = r2_score(y_test, y_pred)
print(f'Mean Squared Error: {mse}')
print(f'R-squared: {r_squared}')
# Print the coefficients and intercept
coefficients = pd.DataFrame({'Variable': X_train.columns, 'Coefficient': model.coef_})
print(coefficients)
print('Intercept:', model.intercept_)
Mean Squared Error: 2604582.619278291
R-squared: 0.7779923417120967
Variable Coefficient
0 total_no_of_lactating_mothers_re 2479.581092
1 total_no_of_beneficiaries_receiv 154.788593
2 total_grd_and_pg_in_village 463.233033
3 total_trained_trainee_under_skil -68.587941
4 total_no_of_elect_rep_oriented_u -48.137214
5 total_shg 219.514006
6 avail_of_mc_health_facilities_YES -6.462896
Intercept: 3997.539148183692
Looking at the result above, the availability of the health facilities has small coefficient. Thus, basically we can ignore it. Let us check first the effect of it on RMSE and R^2. After running program below, here is the result: the R^2 before and after dropping the categorical variable changes insignificantly. Thus, again, basically we can ignore this categorical variable and focus more on numerical ones.
X_train = X_train.drop('avail_of_mc_health_facilities_YES', axis=1)
X_test = X_test.drop('avail_of_mc_health_facilities_YES', axis=1)
# Linear regression model
model = LinearRegression()
model.fit(X_train, y_train)
# Make predictions on the test set
y_pred = model.predict(X_test)
# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r_squared = r2_score(y_test, y_pred)
print(f'Mean Squared Error: {mse}')
print(f'R-squared: {r_squared}')
# Print the coefficients and intercept
coefficients = pd.DataFrame({'Variable': X_train.columns, 'Coefficient': model.coef_})
print(coefficients)
print('Intercept:', model.intercept_)
Mean Squared Error: 2604258.90133897
R-squared: 0.7780199345636817
Variable Coefficient
0 total_no_of_lactating_mothers_re 2480.142199
1 total_no_of_beneficiaries_receiv 154.662329
2 total_grd_and_pg_in_village 462.742130
3 total_trained_trainee_under_skil -68.670872
4 total_no_of_elect_rep_oriented_u -47.721852
5 total_shg 219.359304
Intercept: 3995.5178957275484
Well, that makes our effort to combine between Open Government Data (external) of India with our SHRUG useless then?
Of course not. Let us just enjoy the process! We will not "feel" about the difficulty of making linear regression with categorical variables if we never try to fail in this, right?
Also, take a break in checking my assignment, Prof! Our journey is still halfway!
Thanks anyway for reading up until this, here is some icebreaking and what I feel when I add new data but then do not need it after some analysis:

Let us continue to do multivariate regression, this time without putting the Categorical Variables.
# Define the dependent variables (output)
y_var = train_data['total_childs_aged_0_to_3_years_i']
#y_var /= 1000.0
# Define the broad independent variables that you want to narrow it down!
x_var = train_data[[
'total_no_of_lactating_mothers_re',
'total_no_of_beneficiaries_receiv',
'total_grd_and_pg_in_village',
'total_trained_trainee_under_skil',
'total_no_of_elect_rep_oriented_u',
'total_shg']].copy()
# Add a constant term to the independent variables matrix
#x_var /= 1000.0
x_var = sm.add_constant(x_var)
# Fit the OLS model
model = sm.OLS(y_var, x_var).fit()
# Print the summary of the regression
model.summary()
| Dep. Variable: | total_childs_aged_0_to_3_years_i | R-squared: | 0.762 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared: | 0.762 |
| Method: | Least Squares | F-statistic: | 2194. |
| Date: | Fri, 12 Jan 2024 | Prob (F-statistic): | 0.00 |
| Time: | 16:01:08 | Log-Likelihood: | -36257. |
| No. Observations: | 4109 | AIC: | 7.253e+04 |
| Df Residuals: | 4102 | BIC: | 7.257e+04 |
| Df Model: | 6 | ||
| Covariance Type: | nonrobust |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| const | 305.3273 | 45.429 | 6.721 | 0.000 | 216.262 | 394.393 |
| total_no_of_lactating_mothers_re | 2.9507 | 0.048 | 61.079 | 0.000 | 2.856 | 3.045 |
| total_no_of_beneficiaries_receiv | 0.3486 | 0.080 | 4.344 | 0.000 | 0.191 | 0.506 |
| total_grd_and_pg_in_village | 0.0613 | 0.005 | 12.448 | 0.000 | 0.052 | 0.071 |
| total_trained_trainee_under_skil | -0.0349 | 0.025 | -1.375 | 0.169 | -0.085 | 0.015 |
| total_no_of_elect_rep_oriented_u | -0.0973 | 0.056 | -1.735 | 0.083 | -0.207 | 0.013 |
| total_shg | 0.2439 | 0.026 | 9.464 | 0.000 | 0.193 | 0.294 |
| Omnibus: | 1255.672 | Durbin-Watson: | 2.019 |
|---|---|---|---|
| Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 10904.492 |
| Skew: | 1.205 | Prob(JB): | 0.00 |
| Kurtosis: | 10.608 | Cond. No. | 1.67e+04 |
# Also calculate the RMSE
model.mse_resid**0.5
1645.336422386731
Result interpretation:
Note: Low Condition Number (Close to 1): Indicates low multicollinearity, and the independent variables are not highly correlated. High Condition Number (> 30): Suggests moderate to severe multicollinearity. The interpretation of individual coefficients may be less stable, and the model could be sensitive to small changes in the data. Very High Condition Number (> 100): Indicates a serious problem with multicollinearity. Parameter estimates become highly unstable, and the model may be unreliable.
To reduce the multicollinearity between independent variables, we will try to use the VIF (Variance inflation factors) that we have done before!
Recap the VIF result, variables that makes all VIF < 5:
# Do regression after judgement using VIF
# Define the dependent variables (output)
y_var = train_data['total_childs_aged_0_to_3_years_i']
# Define the broad independent variables that you want to narrow it down!
x_var = train_data[[
'total_no_of_lactating_mothers_re',
'total_grd_and_pg_in_village',
'total_trained_trainee_under_skil',
'total_no_of_elect_rep_oriented_u',
'total_shg']].copy()
# Add a constant term to the independent variables matrix
x_var = sm.add_constant(x_var)
# Fit the OLS model
model = sm.OLS(y_var, x_var).fit()
# Print the summary of the regression
condition_number = model.condition_number
condition_number
16622.94106327228
# Also calculate the RMSE
model.mse_resid**0.5
1648.9163346189816
Notice the before after conditions (changing independent variables using VIF), the RMSE does not change much (before: 1645, after 1648)!
Both still have high condition number, indicating that the problem is not collinearity.
After do some research, this inspired me (https://stats.stackexchange.com/questions/243000/cause-of-a-high-condition-number-in-a-python-statsmodels-regression). We can possible explanation of why it has high condition number >30: because we're fitting a line to the points and then projecting the line all the way back to the origin (x=0) to find the y-intercept. That y-intercept will be very sensitive to small movements in the data points.
It happens probably because we have number in thousands. To justify mathematically, we will try to reduce the scale by dividing all variables by 1000 (thus making the units into 'in thousands').
# Dividing by 1000 and see what happense
# Define the dependent variables (output)
y_var = train_data['total_childs_aged_0_to_3_years_i'] / 1000.0
# Define the broad independent variables that you want to narrow it down!
x_var = train_data[[
'total_no_of_lactating_mothers_re',
'total_no_of_beneficiaries_receiv',
'total_grd_and_pg_in_village',
'total_trained_trainee_under_skil',
'total_no_of_elect_rep_oriented_u',
'total_shg']].copy() / 1000
# Add a constant term to the independent variables matrix
x_var = sm.add_constant(x_var)
# Fit the OLS model
model = sm.OLS(y_var, x_var).fit()
# Print the summary of the regression
model.summary()
| Dep. Variable: | total_childs_aged_0_to_3_years_i | R-squared: | 0.757 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared: | 0.756 |
| Method: | Least Squares | F-statistic: | 2128. |
| Date: | Fri, 12 Jan 2024 | Prob (F-statistic): | 0.00 |
| Time: | 15:36:06 | Log-Likelihood: | -7938.0 |
| No. Observations: | 4109 | AIC: | 1.589e+04 |
| Df Residuals: | 4102 | BIC: | 1.593e+04 |
| Df Model: | 6 | ||
| Covariance Type: | nonrobust |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| const | 0.2599 | 0.047 | 5.551 | 0.000 | 0.168 | 0.352 |
| total_no_of_lactating_mothers_re | 2.9804 | 0.050 | 59.275 | 0.000 | 2.882 | 3.079 |
| total_no_of_beneficiaries_receiv | 0.3893 | 0.082 | 4.728 | 0.000 | 0.228 | 0.551 |
| total_grd_and_pg_in_village | 0.0555 | 0.005 | 10.914 | 0.000 | 0.046 | 0.065 |
| total_trained_trainee_under_skil | -0.0205 | 0.027 | -0.765 | 0.444 | -0.073 | 0.032 |
| total_no_of_elect_rep_oriented_u | -0.1484 | 0.057 | -2.610 | 0.009 | -0.260 | -0.037 |
| total_shg | 0.2877 | 0.027 | 10.685 | 0.000 | 0.235 | 0.340 |
| Omnibus: | 1285.769 | Durbin-Watson: | 2.053 |
|---|---|---|---|
| Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 10386.574 |
| Skew: | 1.262 | Prob(JB): | 0.00 |
| Kurtosis: | 10.368 | Cond. No. | 31.8 |
Notice that the R^2 is not changed, but the condition number is reduced and getting closed to 30!
As long as we know the cause of this, we can temporary ignore this (especially it is not because of the multicollinearity problem).
Then, we will not reduce the variables that initially we think affects the collinearity. Instead, we will reduce the variables 'total_trained_trainee_under_skil' (total people under development training) and 'total_no_of_elect_rep_oriented_u' (No of elected representatives oriented under Rashtriya Gram Swaraj Abhiyan) because of two reasons:
Let us see the effect on the RMSE below.
# Dropping the high-p-value variables and less weight (coefficient)
# Define the dependent variables (output)
y_var = train_data['total_childs_aged_0_to_3_years_i']
# Define the broad independent variables that you want to narrow it down!
x_var = train_data[[
'total_no_of_lactating_mothers_re',
'total_no_of_beneficiaries_receiv',
'total_grd_and_pg_in_village',
'total_shg']].copy()
# Add a constant term to the independent variables matrix
x_var = sm.add_constant(x_var)
# Fit the OLS model
model = sm.OLS(y_var, x_var).fit()
# Print the summary of the regression
model.summary()
| Dep. Variable: | total_childs_aged_0_to_3_years_i | R-squared: | 0.756 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared: | 0.756 |
| Method: | Least Squares | F-statistic: | 3184. |
| Date: | Fri, 12 Jan 2024 | Prob (F-statistic): | 0.00 |
| Time: | 15:36:06 | Log-Likelihood: | -36326. |
| No. Observations: | 4109 | AIC: | 7.266e+04 |
| Df Residuals: | 4104 | BIC: | 7.269e+04 |
| Df Model: | 4 | ||
| Covariance Type: | nonrobust |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| const | 248.6636 | 46.683 | 5.327 | 0.000 | 157.140 | 340.187 |
| total_no_of_lactating_mothers_re | 2.9632 | 0.050 | 59.317 | 0.000 | 2.865 | 3.061 |
| total_no_of_beneficiaries_receiv | 0.3168 | 0.078 | 4.041 | 0.000 | 0.163 | 0.470 |
| total_grd_and_pg_in_village | 0.0535 | 0.005 | 11.160 | 0.000 | 0.044 | 0.063 |
| total_shg | 0.2856 | 0.026 | 10.781 | 0.000 | 0.234 | 0.338 |
| Omnibus: | 1276.635 | Durbin-Watson: | 2.056 |
|---|---|---|---|
| Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 10319.017 |
| Skew: | 1.251 | Prob(JB): | 0.00 |
| Kurtosis: | 10.349 | Cond. No. | 1.66e+04 |
# Also calculate the RMSE
model.mse_resid**0.5
1672.8305560627273
Notice that the RMSE does not increase after eliminating non-significant variables. Leaving variables below ranked by the weight (coef) that impacting the number of immunization on children aged 0-3 years.
What about the per capita variables? Let us do the regression!
# Define the dependent variables (output)
y_var = train_data['immunized_child_pct']
# Define the broad independent variables that you want to narrow it down!
x_var = train_data[[
'icds_women_pct',
'pmmvy_pct',
'grd_rat',
'trained_ratio',
'trained_repr_rat',
'shg_rat']].copy()
# 'availability_of_mother_child_health_facilities'
# Add a constant term to the independent variables matrix
x_var = sm.add_constant(x_var)
# Fit the OLS model
model = sm.OLS(y_var, x_var).fit()
# Print the summary of the regression
model.summary()
| Dep. Variable: | immunized_child_pct | R-squared: | 0.222 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared: | 0.220 |
| Method: | Least Squares | F-statistic: | 194.6 |
| Date: | Fri, 12 Jan 2024 | Prob (F-statistic): | 8.23e-219 |
| Time: | 15:36:06 | Log-Likelihood: | -16931. |
| No. Observations: | 4109 | AIC: | 3.388e+04 |
| Df Residuals: | 4102 | BIC: | 3.392e+04 |
| Df Model: | 6 | ||
| Covariance Type: | nonrobust |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| const | 32.7339 | 1.462 | 22.389 | 0.000 | 29.868 | 35.600 |
| icds_women_pct | 0.4297 | 0.017 | 25.758 | 0.000 | 0.397 | 0.462 |
| pmmvy_pct | 0.0004 | 0.013 | 0.029 | 0.977 | -0.024 | 0.025 |
| grd_rat | 0.6184 | 0.083 | 7.432 | 0.000 | 0.455 | 0.782 |
| trained_ratio | -1.5371 | 0.374 | -4.113 | 0.000 | -2.270 | -0.804 |
| trained_repr_rat | -2.5610 | 0.790 | -3.243 | 0.001 | -4.109 | -1.013 |
| shg_rat | 2.2448 | 0.286 | 7.851 | 0.000 | 1.684 | 2.805 |
| Omnibus: | 257.458 | Durbin-Watson: | 1.998 |
|---|---|---|---|
| Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 344.522 |
| Skew: | -0.568 | Prob(JB): | 1.54e-75 |
| Kurtosis: | 3.850 | Cond. No. | 737. |
Notice that the R^2 is very low (0.2) compared to the non per capita variables before (0.76).
Below, we can plot immunization rate vs total population of children on the same age. Notice that the correlation is unclear:
Therefore, I recommend that we do not use the immunization rate as dependent variable that is affected by the independent variables that we have chosen or narrowed down before.
However, we will use the rate of immunization as dependent variable when comparing to other district/ subdistricts, in which we will do in next Part or Chapter. It is align with recommendation from credible resource [1], stating that Per capita is recommended to be used in following conditions:
It would be interesting combining this rate or per capita variable by using this [1] principle and Tobler's First Law of Geography.
Reference: [1] Uslaner, E. M. (1976). The Pitfalls of Per Capita. American Journal of Political Science, 20(1), 125–133. https://doi.org/10.2307/2110513
sns.regplot(x='total_childs_aged_0_to_3_years', y='immunized_child_pct', data=dfreg_encoded,line_kws={'color': 'black'})
plt.title('Regression Plot: Total Children age 0-3 years vs Immunization Level')
plt.xlabel('Total Children Aged 0 to 3 Years')
plt.ylabel('Immunization Level (Percentage)')
plt.show()
Linear Regression
In this sub part, we will focus on the narrowed down independent variable againts the dependent variables. Those are:
The coefficient shows the weight (impact) we got from the Multivariate Regression.
Looking more deeper also means evaluating important variables:
1. Total number of children with immunization vs Total number of lactating mothers receiving ICDS subsidy
Here, we will use sklearn because it is simpler and usually be used to compare between test and train data.
# Define the dependent variables (output) here
y_train = train_data['total_childs_aged_0_to_3_years_i']
y_test = test_data['total_childs_aged_0_to_3_years_i']
# Define the broad independent variables that we want to narrow it down!
x_train = train_data[[
'total_no_of_lactating_mothers_re']].copy()
x_test = test_data[[
'total_no_of_lactating_mothers_re']].copy()
# Creating and training model
lm = LinearRegression()
lm.fit(x_train, y_train)
# Model making a prediction on test data
y_pred = lm.predict(x_test)
print('R2:', round( lm.score(x_test, y_test),4) )
rmse_train = np.sqrt(metrics.mean_squared_error(y_train, lm.predict(x_train)))
print('RMSE on training set:', round(rmse_train,4) )
rmse_test = np.sqrt(metrics.mean_squared_error(y_test, y_pred))
print('RMSE on test set:', round(rmse_test,4) )
difference_rmse = abs(rmse_test-rmse_train)/rmse_train * 100
print('Difference in %:', round(difference_rmse,3) )
R2: 0.7514 RMSE on training set: 1702.7484 RMSE on test set: 1697.1431 Difference in %: 0.329
When it comes to training models, there are two major problems one can encounter: overfitting and underfitting.
Overfitting: when the model performs well on the training set but not on test data. Underfitting: when it neither performs well on the train set nor on the test set.
In this prediction, the difference between RMSE of training and test dataset is quite small (around 0-2 %), thus it is just right fit. Linear Regression is proper here.
# Set blind-friendly color palette
sns.set_palette("colorblind")
# Scatter plot for train data
plt.scatter(x_train, y_train, label='Train', alpha=0.7)
# Scatter plot for test data
plt.scatter(x_test, y_test, label='Test', alpha=0.7)
# Regression line
plt.plot(x_test, y_pred, label='Regression', color='black')
# Add labels and title
plt.xlabel('Total Number of Lactating Mothers')
plt.ylabel('Immunized Childs Aged 0 to 3 Years')
plt.title('No of Children (0-3 yrs) with Immunization vs Lactating Mothers under ICDS')
# Display legend
plt.legend()
# Display the plot
plt.show()
We can also use a machine learning tools called "Decision Tree Regressor", where there is regression method inside the SKLEARN that also vary the depth of the decision tree.
More depth means more complex. As we can see from graph below, there is underfitting because the more complex the depth (means facing more "unexpected x values", train and test are not converge.
# Decision Tree Regressor
depths = [1, 2, 3, 4, 5, 6, 7, 8, 9, 11] # Try different depths
train_rmse = []
test_rmse = []
for depth in depths:
# Create and train the model
dt_regressor = DecisionTreeRegressor(max_depth=depth)
dt_regressor.fit(x_train, y_train)
# Predictions
y_train_pred = dt_regressor.predict(x_train)
y_test_pred = dt_regressor.predict(x_test)
# Calculate RMSE (Root Mean Squared Error)
train_rmse.append(mean_squared_error(y_train, y_train_pred, squared=False))
test_rmse.append(mean_squared_error(y_test, y_test_pred, squared=False))
# Plotting the results
plt.figure(figsize=(8, 4))
plt.plot(depths, train_rmse, label='Training')
plt.plot(depths, test_rmse, label='Test')
plt.xlabel('Tree Depth')
plt.ylabel('RMSE')
plt.title('Decision Tree Regressor - Model Complexity')
plt.legend()
plt.show()
Now do the same for other variables
2. Total number of children with immunization vs Total number of beneficiaries under PMMVY program
# Define the dependent variables (output) here
y_train = train_data['total_childs_aged_0_to_3_years_i']
y_test = test_data['total_childs_aged_0_to_3_years_i']
# Define the broad independent variables that we want to narrow it down!
x_train = train_data[[
'total_no_of_beneficiaries_receiv']].copy()
x_test = test_data[[
'total_no_of_beneficiaries_receiv']].copy()
# Creating and training model
lm = LinearRegression()
lm.fit(x_train, y_train)
# Model making a prediction on test data
y_pred = lm.predict(x_test)
print('R2:', round( lm.score(x_test, y_test),4) )
rmse_train = np.sqrt(metrics.mean_squared_error(y_train, lm.predict(x_train)))
print('RMSE on training set:', round(rmse_train,4) )
rmse_test = np.sqrt(metrics.mean_squared_error(y_test, y_pred))
print('RMSE on test set:', round(rmse_test,4) )
difference_rmse = abs(rmse_test-rmse_train)/rmse_train * 100
print('Difference in %:', round(difference_rmse,3) )
R2: 0.3939 RMSE on training set: 2601.947 RMSE on test set: 2605.9737 Difference in %: 0.155
# Set blind-friendly color palette
sns.set_palette("colorblind")
# Scatter plot for train data
plt.scatter(x_train, y_train, label='Train', alpha=0.7)
# Scatter plot for test data
plt.scatter(x_test, y_test, label='Test', alpha=0.7)
# Regression line
plt.plot(x_test, y_pred, label='Regression', color='black')
# Add labels and title
plt.xlabel('Total Number of Beneficiaries')
plt.ylabel('Immunized Childs Aged 0 to 3 Years')
plt.title('No of Children (0-3 yrs) with Immunization vs PMMVY Beneficiaries')
# Display legend
plt.legend()
# Display the plot
plt.show()
Compared to previous variable, this one gives bigger difference between RMSE train vs set. Let us try to use polynomial instead, starts with degree of 2.
# Let us try Polynomial Regression!
degree = 2 # You can adjust the degree of the polynomial
poly_features = PolynomialFeatures(degree=degree, include_bias=False) # Set include_bias=False to avoid constant term
x_train_poly = poly_features.fit_transform(x_train)
x_test_poly = poly_features.transform(x_test)
feature_names = x_train.columns
poly_feature_names = [f"{feature_names[i]}^{j}" for i in range(len(feature_names)) for j in range(1, degree + 1)]
# Assign polynomial feature names to transformed data
x_train_poly_df = pd.DataFrame(x_train_poly, columns=poly_feature_names)
x_test_poly_df = pd.DataFrame(x_test_poly, columns=poly_feature_names)
# Do the poly regression yeah
lm = LinearRegression()
lm.fit(x_train_poly_df, y_train)
# Generate a DataFrame for x_range_poly
x_range = pd.DataFrame(np.linspace(min(x_test.values), max(x_test.values), 100), columns=['total_no_of_beneficiaries_receiv'])
x_range_poly = poly_features.transform(x_range)
x_range_poly_df = pd.DataFrame(x_range_poly, columns=poly_feature_names)
# Model making a prediction on x_range
y_pred = lm.predict(x_range_poly_df)
# Evaluation metrics
print('R2:', round(lm.score(x_test_poly_df, y_test), 4))
rmse_train = np.sqrt(metrics.mean_squared_error(y_train, lm.predict(x_train_poly_df)))
print('RMSE on training set:', round(rmse_train, 4))
rmse_test = np.sqrt(metrics.mean_squared_error(y_test, lm.predict(x_test_poly_df)))
print('RMSE on test set:', round(rmse_test, 4))
difference_rmse = abs(rmse_test - rmse_train) / rmse_train * 100
print('Difference in %:', round(difference_rmse, 3))
# Plot
plt.scatter(x_test.values, y_test, label='Test Data', alpha=0.7)
plt.scatter(x_train.values, y_train, label='Train Data', alpha=0.7)
plt.plot(x_range['total_no_of_beneficiaries_receiv'], y_pred, color='black', label='Polynomial Regression (Degree {})'.format(degree))
plt.xlabel('Total Number of Beneficiaries Received')
plt.ylabel('Total Childs Aged 0-3 Years')
plt.title('Polynomial Regression with Degree {}'.format(degree))
plt.legend()
plt.show()
R2: 0.4008 RMSE on training set: 2581.2909 RMSE on test set: 2591.2985 Difference in %: 0.388
We can see that the R^2 is not significantly difference and the RMSE difference train vs test. We can also see the effect in increasing the degree of polynomial below.
degrees = [1, 2, 3, 4, 5, 6]
rmse_train_values = []
rmse_test_values = []
difference_rmse_values = []
for degree in degrees:
poly_features = PolynomialFeatures(degree=degree, include_bias=False)
x_train_poly = poly_features.fit_transform(x_train)
x_test_poly = poly_features.transform(x_test)
# Polynomial feature names
feature_names = x_train.columns
poly_feature_names = [f"{feature_names[i]}^{j}" for i in range(len(feature_names)) for j in range(1, degree + 1)]
# Assign polynomial feature names to transformed data
x_train_poly_df = pd.DataFrame(x_train_poly, columns=poly_feature_names)
x_test_poly_df = pd.DataFrame(x_test_poly, columns=poly_feature_names)
# Polynomial regression
lm = LinearRegression()
lm.fit(x_train_poly_df, y_train)
# Evaluation metrics
rmse_train = np.sqrt(metrics.mean_squared_error(y_train, lm.predict(x_train_poly_df)))
rmse_test = np.sqrt(metrics.mean_squared_error(y_test, lm.predict(x_test_poly_df)))
difference_rmse = abs(rmse_test - rmse_train) / rmse_train * 100
# Append RMSE values
rmse_train_values.append(rmse_train)
rmse_test_values.append(rmse_test)
difference_rmse_values.append(difference_rmse)
# Plotting
plt.figure(figsize=(15, 5))
# Plot 1: RMSE Train
plt.subplot(1, 3, 1)
plt.plot(degrees, rmse_train_values, marker='o')
plt.title('RMSE Train vs Polynomial Degree')
plt.xlabel('Polynomial Degree')
plt.ylabel('RMSE Train')
plt.grid(True)
# Plot 2: RMSE Test
plt.subplot(1, 3, 2)
plt.plot(degrees, rmse_test_values, marker='o')
plt.title('RMSE Test vs Polynomial Degree')
plt.xlabel('Polynomial Degree')
plt.ylabel('RMSE Test')
plt.grid(True)
# Plot 3: Difference RMSE
plt.subplot(1, 3, 3)
plt.plot(degrees, difference_rmse_values, marker='o')
plt.title('Difference in RMSE (Test-Train) vs Polynomial Degree')
plt.xlabel('Polynomial Degree')
plt.ylabel('Difference in RMSE (%)')
plt.grid(True)
plt.tight_layout()
plt.show()
For Total number of children with immunization vs Total number of beneficiaries under PMMVY program, the best fit is Linear Regression, because even on higher polynomial regression, the RMSE difference is quite small. I think it is because the data is too spread out. Might need more knowledge and tools (and brains) to overcome this critically.
# Decision Tree Regressor
depths = [1, 2, 3, 4, 5, 6, 7, 8, 9, 11] # Try different depths
train_rmse = []
test_rmse = []
for depth in depths:
# Create and train the model
dt_regressor = DecisionTreeRegressor(max_depth=depth)
dt_regressor.fit(x_train, y_train)
# Predictions
y_train_pred = dt_regressor.predict(x_train)
y_test_pred = dt_regressor.predict(x_test)
# Calculate RMSE (Root Mean Squared Error)
train_rmse.append(mean_squared_error(y_train, y_train_pred, squared=False))
test_rmse.append(mean_squared_error(y_test, y_test_pred, squared=False))
# Plotting the results
plt.figure(figsize=(8, 4))
plt.plot(depths, train_rmse, label='Training')
plt.plot(depths, test_rmse, label='Test')
plt.xlabel('Tree Depth')
plt.ylabel('RMSE')
plt.title('Decision Tree Regressor - Model Complexity')
plt.legend()
plt.show()
3. Total number of children with immunization vs Total number of community Self Help Groups
# Define the dependent variables (output) here
y_train = train_data['total_childs_aged_0_to_3_years_i']
y_test = test_data['total_childs_aged_0_to_3_years_i']
# Define the broad independent variables that we want to narrow it down!
x_train = train_data[[
'total_shg']].copy()
x_test = test_data[[
'total_shg']].copy()
# Creating and training model
lm = LinearRegression()
lm.fit(x_train, y_train)
# Model making a prediction on test data
y_pred = lm.predict(x_test)
print('R2:', round( lm.score(x_test, y_test),4) )
rmse_train = np.sqrt(metrics.mean_squared_error(y_train, lm.predict(x_train)))
print('RMSE on training set:', round(rmse_train,4) )
rmse_test = np.sqrt(metrics.mean_squared_error(y_test, y_pred))
print('RMSE on test set:', round(rmse_test,4) )
difference_rmse = abs(rmse_test-rmse_train)/rmse_train * 100
print('Difference in %:', round(difference_rmse,3) )
R2: 0.1331 RMSE on training set: 3080.4105 RMSE on test set: 3116.7404 Difference in %: 1.179
# Set blind-friendly color palette
sns.set_palette("colorblind")
# Scatter plot for train data
plt.scatter(x_train, y_train, label='Train', alpha=0.7)
# Scatter plot for test data
plt.scatter(x_test, y_test, label='Test', alpha=0.7)
# Regression line
plt.plot(x_test, y_pred, label='Regression', color='black')
# Add labels and title
plt.xlabel('Total Number of Beneficiaries')
plt.ylabel('Immunized Childs Aged 0 to 3 Years')
plt.title('No of Children (0-3 yrs) with Immunization vs No of SHGs')
# Display legend
plt.legend()
# Display the plot
plt.show()
Jumping to the degree of polynomial and how the model performs, it is better to have lower degree for polynomial. It also shows the sign of underfitting in the Decision Tree Regressor below.
degrees = [1, 2, 3, 4, 5, 6]
rmse_train_values = []
rmse_test_values = []
difference_rmse_values = []
for degree in degrees:
poly_features = PolynomialFeatures(degree=degree, include_bias=False)
x_train_poly = poly_features.fit_transform(x_train)
x_test_poly = poly_features.transform(x_test)
# Polynomial feature names
feature_names = x_train.columns
poly_feature_names = [f"{feature_names[i]}^{j}" for i in range(len(feature_names)) for j in range(1, degree + 1)]
# Assign polynomial feature names to transformed data
x_train_poly_df = pd.DataFrame(x_train_poly, columns=poly_feature_names)
x_test_poly_df = pd.DataFrame(x_test_poly, columns=poly_feature_names)
# Polynomial regression
lm = LinearRegression()
lm.fit(x_train_poly_df, y_train)
# Evaluation metrics
rmse_train = np.sqrt(metrics.mean_squared_error(y_train, lm.predict(x_train_poly_df)))
rmse_test = np.sqrt(metrics.mean_squared_error(y_test, lm.predict(x_test_poly_df)))
difference_rmse = abs(rmse_test - rmse_train) / rmse_train * 100
# Append RMSE values
rmse_train_values.append(rmse_train)
rmse_test_values.append(rmse_test)
difference_rmse_values.append(difference_rmse)
# Plotting
plt.figure(figsize=(15, 5))
# Plot 1: RMSE Train
plt.subplot(1, 3, 1)
plt.plot(degrees, rmse_train_values, marker='o')
plt.title('RMSE Train vs Polynomial Degree')
plt.xlabel('Polynomial Degree')
plt.ylabel('RMSE Train')
plt.grid(True)
# Plot 2: RMSE Test
plt.subplot(1, 3, 2)
plt.plot(degrees, rmse_test_values, marker='o')
plt.title('RMSE Test vs Polynomial Degree')
plt.xlabel('Polynomial Degree')
plt.ylabel('RMSE Test')
plt.grid(True)
# Plot 3: Difference RMSE
plt.subplot(1, 3, 3)
plt.plot(degrees, difference_rmse_values, marker='o')
plt.title('Difference in RMSE (Test-Train) vs Polynomial Degree')
plt.xlabel('Polynomial Degree')
plt.ylabel('Difference in RMSE (%)')
plt.grid(True)
plt.tight_layout()
plt.show()
This one is quite complex, because the difference of RMSE is always changing when I re-run the split test/train. However, since the percentage is quite low (RMSE difference is around 2%), we can say that linear is still proper.
# Decision Tree Regressor
depths = [1, 2, 3, 4, 5, 6, 7, 8, 9, 11] # Try different depths
train_rmse = []
test_rmse = []
for depth in depths:
# Create and train the model
dt_regressor = DecisionTreeRegressor(max_depth=depth)
dt_regressor.fit(x_train, y_train)
# Predictions
y_train_pred = dt_regressor.predict(x_train)
y_test_pred = dt_regressor.predict(x_test)
# Calculate RMSE (Root Mean Squared Error)
train_rmse.append(mean_squared_error(y_train, y_train_pred, squared=False))
test_rmse.append(mean_squared_error(y_test, y_test_pred, squared=False))
# Plotting the results
plt.figure(figsize=(8, 4))
plt.plot(depths, train_rmse, label='Training')
plt.plot(depths, test_rmse, label='Test')
plt.xlabel('Tree Depth')
plt.ylabel('RMSE')
plt.title('Decision Tree Regressor - Model Complexity')
plt.legend()
plt.show()
4. Total number of children with immunization vs number of graduates and post graduates
Last but not least,
# Define the dependent variables (output) here
y_train = train_data['total_childs_aged_0_to_3_years_i']
y_test = test_data['total_childs_aged_0_to_3_years_i']
# Define the broad independent variables that we want to narrow it down!
x_train = train_data[[
'total_grd_and_pg_in_village']].copy()
x_test = test_data[[
'total_grd_and_pg_in_village']].copy()
# Creating and training model
lm = LinearRegression()
lm.fit(x_train, y_train)
# Model making a prediction on test data
y_pred = lm.predict(x_test)
print('R2:', round( lm.score(x_test, y_test),4) )
rmse_train = np.sqrt(metrics.mean_squared_error(y_train, lm.predict(x_train)))
print('RMSE on training set:', round(rmse_train,4) )
rmse_test = np.sqrt(metrics.mean_squared_error(y_test, y_pred))
print('RMSE on test set:', round(rmse_test,4) )
difference_rmse = abs(rmse_test-rmse_train)/rmse_train * 100
print('Difference in %:', round(difference_rmse,3) )
R2: 0.365 RMSE on training set: 2726.9544 RMSE on test set: 2667.5487 Difference in %: 2.178
# Set blind-friendly color palette
sns.set_palette("colorblind")
# Scatter plot for train data
plt.scatter(x_train, y_train, label='Train', alpha=0.7)
# Scatter plot for test data
plt.scatter(x_test, y_test, label='Test', alpha=0.7)
# Regression line
plt.plot(x_test, y_pred, label='Regression', color='black')
# Add labels and title
plt.xlabel('Total Number of Graduates and Post Graduates')
plt.ylabel('Immunized Childs Aged 0 to 3 Years')
plt.title('No of Children (0-3 yrs) with Immunization vs No of Grads')
# Display legend
plt.legend()
# Display the plot
plt.show()
# Let us try Polynomial Regression!
degree = 4 # You can adjust the degree of the polynomial
poly_features = PolynomialFeatures(degree=degree, include_bias=False) # Set include_bias=False to avoid constant term
x_train_poly = poly_features.fit_transform(x_train)
x_test_poly = poly_features.transform(x_test)
feature_names = x_train.columns
poly_feature_names = [f"{feature_names[i]}^{j}" for i in range(len(feature_names)) for j in range(1, degree + 1)]
# Assign polynomial feature names to transformed data
x_train_poly_df = pd.DataFrame(x_train_poly, columns=poly_feature_names)
x_test_poly_df = pd.DataFrame(x_test_poly, columns=poly_feature_names)
# Do the poly regression yeah
lm = LinearRegression()
lm.fit(x_train_poly_df, y_train)
# Generate a DataFrame for x_range_poly
x_range = pd.DataFrame(np.linspace(min(x_test.values), max(x_test.values), 100), columns=['total_grd_and_pg_in_village'])
x_range_poly = poly_features.transform(x_range)
x_range_poly_df = pd.DataFrame(x_range_poly, columns=poly_feature_names)
# Model making a prediction on x_range
y_pred = lm.predict(x_range_poly_df)
# Evaluation metrics
print('R2:', round(lm.score(x_test_poly_df, y_test), 4))
rmse_train = np.sqrt(metrics.mean_squared_error(y_train, lm.predict(x_train_poly_df)))
print('RMSE on training set:', round(rmse_train, 4))
rmse_test = np.sqrt(metrics.mean_squared_error(y_test, lm.predict(x_test_poly_df)))
print('RMSE on test set:', round(rmse_test, 4))
difference_rmse = abs(rmse_test - rmse_train) / rmse_train * 100
print('Difference in %:', round(difference_rmse, 3))
# Plot
plt.scatter(x_test.values, y_test, label='Test Data', alpha=0.7)
plt.scatter(x_train.values, y_train, label='Train Data', alpha=0.7)
plt.plot(x_range['total_grd_and_pg_in_village'], y_pred, color='black', label='Polynomial Regression (Degree {})'.format(degree))
plt.xlabel('No of SHGs')
plt.ylabel('Total Childs Aged 0-3 Years')
plt.title('Polynomial Regression with Degree {}'.format(degree))
plt.legend()
plt.show()
R2: 0.402 RMSE on training set: 2638.7538 RMSE on test set: 2588.5381 Difference in %: 1.903
degrees = [1, 2, 3, 4, 5, 6]
rmse_train_values = []
rmse_test_values = []
difference_rmse_values = []
for degree in degrees:
poly_features = PolynomialFeatures(degree=degree, include_bias=False)
x_train_poly = poly_features.fit_transform(x_train)
x_test_poly = poly_features.transform(x_test)
# Polynomial feature names
feature_names = x_train.columns
poly_feature_names = [f"{feature_names[i]}^{j}" for i in range(len(feature_names)) for j in range(1, degree + 1)]
# Assign polynomial feature names to transformed data
x_train_poly_df = pd.DataFrame(x_train_poly, columns=poly_feature_names)
x_test_poly_df = pd.DataFrame(x_test_poly, columns=poly_feature_names)
# Polynomial regression
lm = LinearRegression()
lm.fit(x_train_poly_df, y_train)
# Evaluation metrics
rmse_train = np.sqrt(metrics.mean_squared_error(y_train, lm.predict(x_train_poly_df)))
rmse_test = np.sqrt(metrics.mean_squared_error(y_test, lm.predict(x_test_poly_df)))
difference_rmse = abs(rmse_test - rmse_train) / rmse_train * 100
# Append RMSE values
rmse_train_values.append(rmse_train)
rmse_test_values.append(rmse_test)
difference_rmse_values.append(difference_rmse)
# Plotting
plt.figure(figsize=(15, 5))
# Plot 1: RMSE Train
plt.subplot(1, 3, 1)
plt.plot(degrees, rmse_train_values, marker='o')
plt.title('RMSE Train vs Polynomial Degree')
plt.xlabel('Polynomial Degree')
plt.ylabel('RMSE Train')
plt.grid(True)
# Plot 2: RMSE Test
plt.subplot(1, 3, 2)
plt.plot(degrees, rmse_test_values, marker='o')
plt.title('RMSE Test vs Polynomial Degree')
plt.xlabel('Polynomial Degree')
plt.ylabel('RMSE Test')
plt.grid(True)
# Plot 3: Difference RMSE
plt.subplot(1, 3, 3)
plt.plot(degrees, difference_rmse_values, marker='o')
plt.title('Difference in RMSE (Test-Train) vs Polynomial Degree')
plt.xlabel('Polynomial Degree')
plt.ylabel('Difference in RMSE (%)')
plt.grid(True)
plt.tight_layout()
plt.show()
This one relation also better to be in lower degree, for example the degree is 4 because it has low absolute RMSE.
# Decision Tree Regressor
depths = [1, 2, 3, 4, 5, 6, 7, 8, 9, 11] # Try different depths
train_rmse = []
test_rmse = []
for depth in depths:
# Create and train the model
dt_regressor = DecisionTreeRegressor(max_depth=depth)
dt_regressor.fit(x_train, y_train)
# Predictions
y_train_pred = dt_regressor.predict(x_train)
y_test_pred = dt_regressor.predict(x_test)
# Calculate RMSE (Root Mean Squared Error)
train_rmse.append(mean_squared_error(y_train, y_train_pred, squared=False))
test_rmse.append(mean_squared_error(y_test, y_test_pred, squared=False))
# Plotting the results
plt.figure(figsize=(8, 4))
plt.plot(depths, train_rmse, label='Training')
plt.plot(depths, test_rmse, label='Test')
plt.xlabel('Tree Depth')
plt.ylabel('RMSE')
plt.title('Decision Tree Regressor - Model Complexity')
plt.legend()
plt.show()
Reflection
Technicality: A. Multivariate Regression
Thus, together with correlation and p-values: H1, H2, H3, and H5 is correct.
B. Linear Regression
How to improve?
C. Graphical Excellence
In this part, main activities are:
This part is mainly to test final Hypothesis, the H6.
H6: the level of immunized children (age 0-3 years) is affected by neighbouring area that also has certain level of immunization. The immunization practices of one area (subdistricts) may impact the immunization behavior of nearby area. It popped out from the idea that communities with higher immunization rates create a positive norm that influences neighboring communities.
Main idea: to check how certain area is correlated to its neighbour in the context of "percentage of immunization on children (0-3 years)", we will both do Global an Local Autocorrelation, in which use Moran Plot. We will also examine some regression parameter in this case!
1. Spatial Visualization
Let us see a glimpse of what India looks like!
# Load local Authority geometries using pc11_subdistrict_id code as index
map = gpd.read_file(map_path).set_index('pc11_subdistrict_id', drop=False)
ax = map.plot(figsize=(6, 6), alpha=0.3, color='red');
# Add background map, expressing target CRS so the basemap can be
# reprojected (warped)
ctx.add_basemap(ax, crs=map.crs)
ax.set_title('Map of India')
Text(0.5, 1.0, 'Map of India')
Next, we will see how distributed the Immunization Rate on Children in all whole area of India. It will be easier to understand if we use Choroplets.
Chorophlets that I will make is based on the option "Quantiles". I chose this because, as you can see on Part 02, most of the data are skewed to the left (positive skew), more or less resembles Poisson Distribution with low lambda. This quantiles is suitable for skewed distributions and helps in handling outliers. It provides good representation by placing an approximately equal number of observations in each interval.
num_division = 11 # Divide into how many section
geo_pop = dfreg_encoded
geo_pop = geo_pop.rename(columns={'immunized_child_pct': 'percent_immunized_child'})
classi = mapclassify.Quantiles(geo_pop['percent_immunized_child'], k=num_division)
classi
Quantiles
Interval Count
------------------------
[ 3.40, 47.09] | 467
( 47.09, 57.00] | 475
( 57.00, 63.50] | 464
( 63.50, 68.30] | 465
( 68.30, 72.60] | 469
( 72.60, 76.30] | 469
( 76.30, 80.10] | 465
( 80.10, 83.70] | 467
( 83.70, 87.60] | 473
( 87.60, 92.40] | 461
( 92.40, 100.00] | 462
# Set up the figure to explain the Quantiles that I made
f, ax = plt.subplots(1, figsize=(8, 5))
# Plot the kernel density estimation (KDE)
sns.kdeplot(geo_pop['percent_immunized_child'], fill=True)
# Add a blue tick for every value at the bottom of the plot (rugs)
sns.rugplot(geo_pop['percent_immunized_child'], alpha=0.5)
# Loop over each break point and plot a vertical red line
for cut in classi.bins:
plt.axvline(cut, color='red', linewidth=0.75)
# Display image
ax.set_title('KDE for Immunized Children (0-3) Variable')
plt.show()
Graph Excellence
# Create Choropleths using the 'cividis' colormap
geo_pop.plot(column='percent_immunized_child', scheme='QUANTILES', alpha=1, k=num_division, \
cmap='cividis', # Use cividis colormap
edgecolor='w', linewidth=0.1, figsize=(8, 8))
# Add title with proper centering
plt.title('Choropleths of Immunized Children (0-3) in India, per Subdistricts', pad=20, loc='center')
# Add colorbar with proper positioning
cax = plt.axes([0.95, 0.1, 0.03, 0.8]) # [x, y, width, height]
sm = plt.cm.ScalarMappable(cmap='cividis', norm=plt.Normalize(vmin=geo_pop['percent_immunized_child'].min(), vmax=geo_pop['percent_immunized_child'].max()))
sm._A = [] # empty array for the data range
cbar = plt.colorbar(sm, cax=cax)
# Set colorbar label
cbar.set_label('Percentage of children aged 0-3 years immunized')
# Show the plot
plt.show()
Graph Excellence
Next, we will do Moran Plot, a plot of spatial data against its spatially lagged values, augmented by reporting the summary of influence measures for the linear relationship between the data and the lag. Basically, it is Linear Regression!
Moran Plot is basic component in Global and Local Autocorrelation.
Please note that from here, we will move analysis "up" to district level, because as I have explained before, the "subdistrict" gpkg results in many undesired "island" while creating spatial weight. Let us refresh our brain with the district level dataframe: geo_pop_district.
geo_pop_district.head(3)
| pc11_state_id | pc11_district_id | district_name | geometry | female_population | male_population | total_childs_aged_0_to_3_years | total_childs_aged_0_to_3_years_i | total_grd_and_pg_in_village | total_hhd | total_no_of_beneficiaries_receiv | total_no_of_elect_rep_oriented_u | total_no_of_eligible_beneficiari | total_no_of_lactating_mothers | total_no_of_lactating_mothers_re | total_population | total_shg | total_trained_trainee_under_skil | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 24 | 468.0 | Kachchh | MULTIPOLYGON (((70.45008 23.01226, 70.44904 23... | 6.918732e+05 | 7.659721e+05 | 52789.062307 | 36124.140136 | 29509.874064 | 332446.747616 | 3933.575088 | 2623.179898 | 4593.834795 | 7194.496554 | 6452.874545 | 1.465846e+06 | 5262.200653 | 9285.792694 |
| 1 | 24 | 469.0 | Banas Kantha | MULTIPOLYGON (((71.24964 24.20926, 71.24207 24... | 1.490841e+06 | 1.603543e+06 | 137018.185689 | 93865.516320 | 103408.588403 | 607873.050328 | 9331.840111 | 8527.665757 | 11595.960114 | 19340.773617 | 16878.197941 | 3.273774e+06 | 13546.601715 | 24814.951196 |
| 2 | 24 | 470.0 | Patan | MULTIPOLYGON (((71.42507 23.96967, 71.42497 23... | 5.204153e+05 | 5.651527e+05 | 49457.172519 | 30594.736482 | 64222.702638 | 258319.762048 | 1643.088208 | 3397.482485 | 2437.191204 | 9340.013570 | 7797.987739 | 1.089937e+06 | 14351.448633 | 30910.742259 |
Do not forget to calculate the needed variables in Hypothesis!
Here, Finally, we can use per capita or rate of immunization on Children because we will compare between districts in India. Just reminder: this is based on: Uslaner, E. M. (1976). The Pitfalls of Per Capita. American Journal of Political Science, 20(1), 125–133. https://doi.org/10.2307/2110513
geo_pop_district['percent_immunized_child'] = (geo_pop_district['total_childs_aged_0_to_3_years_i'] / geo_pop_district['total_childs_aged_0_to_3_years']).round(3) * 100
Autocorrelation Preparation
a. Creating Spatial Weight
Before doing ESDA on both Global and Local, we need to create spatial weight matrix. Here, we will use "queen" which defines neighbors as those sharing a common boundary or vertex. The "queen" is used because it is less sensitive to irregular shapes (especially regions) in the spatial distribution of the units
variables = 'pc11_district_id'
weight_df = geo_pop_district
# Check for duplicate entries
if weight_df[variables].duplicated().any():
# If duplicates exist, drop them or handle them as per your data requirements
weight_df = weight_df.drop_duplicates(subset=variables)
# Set the index
weight_df = weight_df.set_index(variables, drop=False)
# Create spatial weights matrix
w_queen = Queen.from_dataframe(weight_df, ids=variables)
/Users/mochammadmiftahulfahmi/anaconda3/envs/install_gds_stack/lib/python3.9/site-packages/libpysal/weights/weights.py:224: UserWarning: The weights matrix is not fully connected: There are 26 disconnected components. There are 6 islands with ids: 71.0, 88.0, 528.0, 638.0, 639.0, 640.0. warnings.warn(message)
As you can see above, the number of island is quite low, in which we can ignore for further exploration. The total number of district here is 604, which makes 6 island is only less than 1% (ignore them will not much affect the data).
w_queen.islands
[71.0, 88.0, 528.0, 638.0, 639.0, 640.0]
# Next drop the islands
weight_df.set_index('pc11_district_id')
weight_df = weight_df.drop(w_queen.islands)
Once we have the set of local authorities that are not an island, we need to re-calculate the weights matrix:
w_queen = Queen.from_dataframe(weight_df, ids=variables)
w_queen.transform = 'R'
/Users/mochammadmiftahulfahmi/anaconda3/envs/install_gds_stack/lib/python3.9/site-packages/libpysal/weights/weights.py:224: UserWarning: The weights matrix is not fully connected: There are 20 disconnected components. warnings.warn(message)
NOTE that
Also check the plot to see how the neighbours number are spreaded! As we can see on graph below, average number of neighbour in certain districts are around 4.
queen_card = pd.Series(w_queen.cardinalities)
sns.displot(queen_card, bins=10)
plt.title('Distribution of Cardinalities for Queen Contiguity')
Text(0.5, 1.0, 'Distribution of Cardinalities for Queen Contiguity')
b. Spatial Lag
Next step is we will need to determine the Spatial Lag.
We will create some new variables:
weight_df['w_percent_immunized_child'] = weights.lag_spatial(w_queen, weight_df['percent_immunized_child'])
# Check whether it works or not :D
weight_df[['pc11_district_id', 'district_name','percent_immunized_child', 'w_percent_immunized_child']].head(5)
| pc11_district_id | district_name | percent_immunized_child | w_percent_immunized_child | |
|---|---|---|---|---|
| pc11_district_id | ||||
| 468.0 | 468.0 | Kachchh | 68.4 | 69.75 |
| 469.0 | 469.0 | Banas Kantha | 68.5 | 65.15 |
| 470.0 | 470.0 | Patan | 61.9 | 68.45 |
| 471.0 | 471.0 | Mahesana | 82.7 | 72.00 |
| 472.0 | 472.0 | Sabar Kantha | 76.1 | 72.60 |
Next, create the normalized variables (std)
weight_df['percent_immunized_child_std'] = (weight_df['percent_immunized_child'] - weight_df['percent_immunized_child'].mean()) / weight_df['percent_immunized_child'].std()
weight_df['w_percent_immunized_child_std'] = weights.lag_spatial(w_queen, weight_df['percent_immunized_child_std'])
c. Moran Plot Regression
Finally, we will see how the w_percent_immunized_child_std correlates with percent_immunized_child_std
f, ax = plt.subplots(1, figsize=(6, 6))
# Plot values with regression line and R-squared value
sns.regplot(x='percent_immunized_child_std', y='w_percent_immunized_child_std', data=weight_df, ci=None, ax=ax,
scatter_kws={'s': 10, 'alpha': 0.7}, line_kws={'color': 'black'})
# Add vertical and horizontal lines
plt.axvline(0, c='k', alpha=0.5)
plt.axhline(0, c='k', alpha=0.5)
# Display R-squared value
corr_value = weight_df['percent_immunized_child_std'].corr(weight_df['w_percent_immunized_child_std'])
r_squared = corr_value**2
plt.text(0.5, 0.1, f'R-squared = {r_squared:.2f}', ha='center', va='center', transform=ax.transAxes, bbox=dict(facecolor='white', edgecolor='white', boxstyle='round,pad=0.5'))
# Display
plt.title('Moran Plot of Standardized Childs Immunization Percentages')
plt.show()
We can see that the R^2 is quite moderate. It is better than our previous model, for immunization rate that only get R^2 = 20%!
Thus, yes, neighbours affects the tendency to have higher percentage of immunization on children.
In the context percentage of immunization on children above, the Moran Plot can be interpreted along the lines of: districts display positive spatial autocorrelation in the way provide immunization to childrens. This means that the districts with high percentage of immunization tend to be located nearby other districts where a significant percentage of immunization happened, and viceversa.
Other than R^2 which explains "how much the model explains the variance", we will need to evaluate how the regression really fit with the actual data.
Thus, as we have done previously in the middle of the story, we will test the predicted model on test-data-set.
Here, we will use 30-70 rule (basically quite as practical as 80-20)
# Split the data into training and test sets
X = weight_df[['percent_immunized_child_std']]
y = weight_df['w_percent_immunized_child_std']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Do linear regression model on the training data
model = LinearRegression()
model.fit(X_train, y_train)
# Of course we need to next Predict
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)
# R-2
corr_value = weight_df['percent_immunized_child_std'].corr(weight_df['w_percent_immunized_child_std'])
r_squared = corr_value**2
# RMSE
rmse_train = np.sqrt(mean_squared_error(y_train, y_train_pred))
rmse_test = np.sqrt(mean_squared_error(y_test, y_test_pred))
# Print R-squared value and RMSE
print(f'R-squared = {r_squared:.2f}')
print(f'RMSE (Train) = {rmse_train:.2f}')
print(f'RMSE (Test) = {rmse_test:.2f}')
R-squared = 0.46 RMSE (Train) = 0.58 RMSE (Test) = 0.59
We can see above that the RMSE does not differ greatly, thus making the Regression Line match with the sample data.
d. Moran I (Global Autocorrelation)
In order to calculate Moran's I in our dataset, we can call a specific function in PySAL directly:
mi = esda.Moran(weight_df['percent_immunized_child'], w_queen)
print("Moran I value = ")
mi.I
Moran I value =
0.5382834157404031
moran_scatterplot(mi);
e. Local Spatial autocorrelation
Moran's I provides an indication of overall clustering but doesn't pinpoint the specific locations of clusters.
To identify the spatial distribution at a finer level, local measures of spatial autocorrelation are crucial. Unlike global measures that analyze the entire dataset, local measures operate on individual observations, offering detailed insights instead of summarizing the entire map.
# Setup the figure and axis
f, ax = plt.subplots(1, figsize=(6, 6))
# Plot values
sns.regplot(x='percent_immunized_child_std', y='w_percent_immunized_child_std', data=weight_df, ci=None)
# Add vertical and horizontal lines
plt.axvline(0, c='k', alpha=0.5)
plt.axhline(0, c='k', alpha=0.5)
plt.text(1.75, 0.5, "HH", fontsize=25)
plt.text(1.5, -1.5, "HL", fontsize=25)
plt.text(-2, 1, "LH", fontsize=25)
plt.text(-1.5, -2.5, "LL", fontsize=25)
# Display
plt.show()
Next identify cases in which the comparison between the value of an observation and the average of its neighbors is either more similar (HH, LL) or dissimilar (HL, LH) than we would expect from pure chance. In order to do so, we need LISA
lisa = esda.Moran_Local(weight_df['percent_immunized_child'], w_queen)
All we need to pass is the variable of interest -percentage of Leave votes- and the spatial weights that describes the neighborhood relations between the different observation that make up the dataset.
Next, to make the map making more straightforward, it is convenient to pull them out and insert them in the main data table (weight_df)
# Break observations into significant or not
weight_df['significant'] = lisa.p_sim < 0.05
# Store the quadrant they belong to
weight_df['quadrant'] = lisa.q
Next, create the significant of the dataframe, followed by the quadrant.
weight_df['significant'].head(20)
pc11_district_id 468.0 False 469.0 False 470.0 False 471.0 False 472.0 False 473.0 False 474.0 False 475.0 False 476.0 False 477.0 False 478.0 False 479.0 False 480.0 False 481.0 False 482.0 False 483.0 False 484.0 False 485.0 False 486.0 False 487.0 True Name: significant, dtype: bool
weight_df['quadrant'].head()
pc11_district_id 468.0 3 469.0 3 470.0 3 471.0 1 472.0 1 Name: quadrant, dtype: int64
The correspondence between the numbers in the variable and the actual quadrants is as follows:
HH (High-High): Explanation: Districts in the High-High quadrant have high immunization percentages and are surrounded by other districts with high immunization percentages. Conclusion: This indicates spatial clustering of high immunization rates, suggesting a positive spatial autocorrelation. High-performing districts are geographically close to other high-performing districts.
LL (Low-Low): Explanation: Districts in the Low-Low quadrant have low immunization percentages and are surrounded by other districts with low immunization percentages. Conclusion: This suggests spatial clustering of low immunization rates, indicating a negative spatial autocorrelation. Districts with low immunization rates tend to be close to other districts with similarly low rates.
LH (Low-High): Explanation: Districts in the Low-High quadrant have low immunization percentages but are surrounded by districts with high immunization percentages. Conclusion: This pattern indicates spatial outliers where districts with low immunization rates are in close proximity to districts with high rates. It may be an area of concern or a unique spatial pattern.
HL (High-Low): Explanation: Districts in the High-Low quadrant have high immunization percentages but are surrounded by districts with low immunization percentages. Conclusion: Similar to LH, this pattern represents spatial outliers where districts with high immunization rates are in close proximity to districts with low rates. This, too, could be an area of concern or a unique spatial pattern.
Next, With these two significant and quadrant, we can build a typical LISA cluster map:
plot_local_autocorrelation(lisa, weight_df, 'percent_immunized_child');
Reflection
Regarding to H6:
if mi.p_sim < 0.05:
print("Spatial autocorrelation is statistically significant.")
if mi.I > 0:
print("Positive spatial autocorrelation (HH or LL).")
else:
print("Negative spatial autocorrelation (LH or HL).")
else:
print("No statistically significant spatial autocorrelation detected.")
Spatial autocorrelation is statistically significant. Positive spatial autocorrelation (HH or LL).
Principle of excellence of Graph:
After taking this long story, I would like to thanks to Mr. Trivik and TAs for answering my question in lab! and of course reading my story here.
Conclusion
Some hypothesis in this preliminary research are tested, but all of them needs to be evaluated with expertise such as Trivik to fully delete possible biases. So far:
Reflection using Critical Data Science Framework
Important notes to the user of this preliminary research, in context of "Communicat e Findings Transparency and accessibility of the results"